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About This Manual 


Preface 


Read This First 


This manual is a reference for programming TMS320C6000 digital signal pro- 
cessor (DSP) devices. 


Before you use this book, you should install your code generation and debug- 
ging tools. 


This book is organized in five major parts: 


Lj 


Part I: Introduction includes a brief description of the C6000 architecture 
and code development flow. It also includes a tutorial that introduces you 
to the tools you will use in each phase of development and an optimization 
checklist to help you achieve optimal performance from your code. 


Part Il: C Code includes C code examples and discusses optimization 
methods for the code. This information can help you choose the most 
appropriate optimization techniques for your code. 


Part lll: Assembly Code describes the structure of assembly code. It pro- 
vides examples and discusses optimizations for assembly code. It also in- 
cludes a chapter on interrupt subroutines. 


Part IV: ’°C64x Programming Techniques describes programming con- 
siderations for the ’C64x. 


Part V: Appendix provides a summary of feedback solutions. 


Related Documentation From Texas Instruments 


Related Documentation From Texas Instruments 


The following books describe the TMS320C6000 devices and related support 
tools. To obtain a copy of any of these TI documents, call the Texas Instru- 
ments Literature Response Center at (800) 477-8924. When ordering, please 
identify the book by its title and literature number. 


TMS320C6000 Assembly Language Tools User’s Guide (literature number 
SPRU186) describes the assembly language tools (assembler, linker, 
and other tools used to develop assembly language code), assembler 
directives, macros, common object file format, and symbolic debugging 
directives for the "C6000 generation of devices. 


TMS320C6000 Optimizing C Compiler User’s Guide (literature number 
SPRU187) describes the C6000 C compiler and the assembly optimizer. 
This C compiler accepts ANSI standard C source code and produces as- 
sembly language source code for the ‘C6000 generation of devices. The 
assembly optimizer helps you optimize your assembly code. 


TMS320C6000 CPU and Instruction Set Reference Guide (literature 
number SPRU189) describes the C6000 CPU architecture, instruction 
set, pipeline, and interrupts for these digital signal processors. 


TMS320C6000 Peripherals Reference Guide (literature number SPRU190) 
describes common peripherals available on the TMS320C6201/6701 
digital signal processors. This book includes information on the internal 
data and program memories, the external memory interface (EMIF), the 
host port interface (HPI), multichannel buffered serial ports (McBSPs), 
direct memory access (DMA), enhanced DMA (EDMA), expansion bus, 
clocking and phase-locked loop (PLL), and the power-down modes. 


TMS320C64x Technical Overview (SPRU395) The TMS320C64x technical 
overview gives an introduction to the ’C64x digital signal processor, and 
discusses the application areas that are enhanced by the ’C64x VelociT1. 


TMS320 DSP Designer’s Notebook: Volume 1 (literature number 
SPRT125) presents solutions to common design problems using ’C2x, 
*C8x, ’C4x, ’'C5x, and other TI DSPs. 
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Solaris and SunOS are trademarks of Sun Microsystems, Inc. 
VelociTl is a trademark of Texas Instruments Incorporated. 


Windows and Windows NT are registered trademarks of Microsoft 
Corporation. 
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Chapter 1 


Introduction 


This chapter introduces some features of the C6000 microprocessor and dis- 
cusses the basic process for creating code and understanding feedback. Any 
reference to C6000 pertains to the ’C62x (fixed-point), ’C64x (fixed-point), and 
the ’C67x (floating-point) devices. Though most of the examples shown are 
fixed-point specific, all techniques are applicable to each device. 
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1.1 TMS320C6000 Architecture 


The 'C62x is a fixed-point digital signal processor (DSP) and is the first DSP 
to use the VelociT|™ architecture. VelociTI is a high-performance, advanced 
very-long-instruction-word (VLIW) architecture, making it an excellent choice 
for multichannel, multifunctional, and performance-driven applications. 


The ’C67x is a floating-point DSP with the same features. It is the second DSP 
to use the VelociT|™ architecture. 


The ’C64x is a fixed-point DSP with the same features. It is the third DSP to 
use the VelociT!™ architecture. 


The ‘C6000 DSPs are based on the C6000 CPU, which consists of: 


Program fetch unit 

Instruction dispatch unit 

Instruction decode unit 

Two data paths, each with four functional units 
Thirty-two 32-bit registers (‘C62x and ’C67x) 
Sixty-four 32-bit registers ('C64x) 

Control registers 

Control logic 

Test, emulation, and interrupt logic 
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1.2 TMS320C6000 Pipeline 


The ’C6000 pipeline has several features that provide optimum performance, 
low cost, and simple programming. 


_) Increased pipelining eliminates traditional architectural bottlenecks in pro- 
gram fetch, data access, and multiply operations. 


(4 Pipeline control is simplified by eliminating pipeline locks. 


_j The pipeline can dispatch eight parallel instructions every cycle. 


(j Parallel instructions proceed simultaneously through the same pipeline 
phases. 
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1.3. Code Development Flow To Increase Performance 


Traditional development flows in the DSP industry have involved validating a 
C model for correctness on a host PC or Unix workstation and then painstak- 
ingly porting that C code to hand coded DSP assembly language. This is both 
time consuming and error prone. This process tends to encounter difficulties 
that can arise from maintaining the code over several projects. 


The recommended code development flow involves utilizing the "C6000 code 
generation tools to aid in optimization rather than forcing the programmer to 
code by hand in assembly. These advantages allow the compiler to do all the 
laborious work of instruction selection, parallelizing, pipelining, and register al- 
location. This allows the programmer to focus on getting the product to market 
quickly. These features simplify the maintenance of the code, as everything 
resides in a C framework that is simple to maintain, support, and upgrade. 


The recommended code development flow for the C6000 involves the phases 
described below. The tutorial section of the Programmer’s Guide focuses on 
phases 1 —3. These phases will instruct the programmer when to go to the tun- 
ing stage of phase 3. What is learned is the importance of giving the compiler 
enough information to fully maximize its potential. An added advantage is that 
this compiler provides direct feedback on the entire programmer’s high MIPS 
areas (loops). Based on this feedback, there are some very simple steps the 
programmer can take to pass complete and better information to the compiler 
allowing the programmer a quicker start in maximizing compiler performance. 
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Code Development Flow To Increase Performance 
You can achieve the best performance from your ’C6000 code if you follow this 
code development flow when you are writing and debugging your code: 

Phase 1: Write C code 
Develop C Code 


Compile 


Profile 


Ss 
Complete 


Ont 


No 


Refine C code 
Phase 2: 


Refine C Code : 
Compile 


Profile 


Yes 
Complete 


y, 


No 


Yes 
optimization? 


— Write linear assembly 
Phase 3: 


Write Linear 
Assembly Assembly optimize 


No 
: <Enat> 
Yes 


Complete 
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The following table lists the phases in the 3-step software development flow 
shown on the previous page, and the goal for each phase: 


Phase Goal 


1 You can develop your C code for phase 1 without any knowledge of 
the ‘C6000. Use the ’C6000 profiling tools that are described in the 
Code Composer Studio User's Guide to identify any inefficient areas 
that you might have in your C code. To improve the performance of 
your code, proceed to phase 2. 


2 Use techniques described in this book to improve your C code. Use 
the ’C6000 profiling tools to check its performance. If your code is 
still not as efficient as you would like it to be, proceed to phase 3. 


3 Extract the time-critical areas from your C code and rewrite the code 
in linear assembly. You can use the assembly optimizer to optimize 
this code. 


Because most ofthe millions of instructions per second (MIPS) in DSP applica- 
tions occur in tight loops, it is important for the "C6000 code generation tools 
to make maximal use of all the hardware resources in important loops. Fortu- 
nately, loops inherently have more parallelism than non-looping code because 
there are multiple iterations of the same code executing with limited depen- 
dencies between each iteration. Through a technique called software pipelin- 
ing, the C6000 code generation tools use the multiple resources of the Veloci- 
Tl architecture efficiently and obtain very high performance. 


This chapter shows the code development flow recommended to achieve the 
highest performance on loops and provides a feedback list that can be used 
to optimize loops with references to more detailed documentation. 
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Table 1-1, Code Development Steps, describes the recommended code de- 
velopment flow for developing code which achieves the highest performance 
on loops. 


Table 1-1. Code Development Steps 


Step __ Description 
1 Compile and profile native C/C++ code 


1) Validates original C/C++ code 


Phase j) Determines which loops are most important in terms of MIPS require- 
1 ments. 

2 Add restrict qualifier, loop iteration count, memory bank, and data alignment 
information. 


[J Reduces potential pointer aliasing problems 


Allows loops with indeterminate iteration counts to execute epilogs 


4 
(j Uses pragmas to pass count information to the compiler 


Uses memory bank pragmas and _nassert intrinsic to pass memory 
bank and alignment information to the compiler. 


Phase 
Optimize C code using other C6000 intrinsics and other methods 


(J Facilitates use of certain C6000 instructions not easily represented in 
C. 


Lj) Optimizes data flow bandwidth (uses word access for short (’C62x, 
’C64x, and ’C67x) data, and double word access for word (’C64x, and 
’C67x) data). 


4a Write linear assembly 
j Allows control in determining exact ‘C6000 instructions to be used 


Lj Provides flexibility of hand-coded assembly without worry of pipelining, 


parallelism, or register allocation. 
Phase . : 
3 [) Can pass memory bank information to the tools 


Lj Uses .trip directive to convey loop count information 


4b Add partitioning information to the linear assembly 


(j Can improve partitioning of loops when necessary 


J Can avoid bottlenecks of certain hardware resources 
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When you achieve the desired performance in your code, there is no need to 
move to the next step. Each of the steps in the development involve passing 
more information to the C6000 tools. Even at the final step, development time 
is greatly reduced from that of hand-coding, and the performance approaches 
the best that can be achieved by hand. 


Internal benchmarking efforts at Texas Instruments have shown that most 
loops achieve maximal throughput after steps 1 and 2. For loops that do not, 
the C/C++ compiler offers a rich set of optimizations that can fine tune all from 
the high level C language. For the few loops that need even further optimiza- 
tions, the assembly optimizer gives the programmer more flexibility than 
C/C++ can offer, works within the framework of C/C++, and is much like pro- 
gramming in higher level C. For more information on the assembly optimizer, 
see the TMS320C6000 Optimizing C/C++ Compiler User’s Guide and Chapter 
5, Optimizing Assembly Code via Linear Assembly, in this book. 


In order to aid the development process, some feedback is enabled by default 
in the code generation tools. Example 1-1 shows output from the compiler 
and/or assembly optimizer of a particular loop. The -mw feedback option gen- 
erates additional information not shown in Example 1—1, such as a single it- 
eration view of the loop. 
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Example 1-1. Compiler and/or Assembly Optimizer Feedback 


SOFTWARE PIPELINE INFORMATION 


Known Minimum Trip Count 

Known Maximum Trip Count 

Known Max Trip Count Factor 
Loop Carried Dependency Bound (%) 
Unpartitioned Resource Bound 
Partitioned Resource Bound (*) 
Resource Partition: 


A-sid 

units 2 

units 

units 

units 

cross paths 
.T address paths 
Long read paths 
Long write paths 
Logical ops (.LS) 
Addition ops (.LSD) 
Bound(.L .S .LS) 
Bound(.L .S .D .LS .LSD) 


-L or .S unit) 
-L or .S or .D unit) 


SBPPBPWRrFODOWOOSR 


OV Gay OO OP PoP as 


Searching for software pipeline schedul 

ii = 5 Register is live too long 

ii = 6 Did not find schedule 

ii = 7 Schedule found with 3 iterations in parallel 
done 


Epilog not entirely removed 
Collapsed epilog stages 


Prolog not removed 
Collapsed prolog stages 


Minimum required memory pad 


Minimum safe trip count 


o* 
o* 
o* 
o* 
o* 
o* 
2 * 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
ok 
o* 
o* 
ox 
ok 
o* 
o* 
2* 
o* 
o* 
o* 
o* 
2* 
o* 
o* 
2 * 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 


This feedback is important in determining which optimizations might be useful 
for further improved performance. The section Understanding Feedback on 
page 2-2 is provided as a quick reference to techniques that can be used to 
optimize loops and refers to specific sections within this book for more detail. 
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Optimizing C/C++ Code 


You can maximize C/C++ performance by using compiler options, intrinsics, 
and code transformations. This chapter discusses the following topics: 


_j The compiler and its options 
_} Intrinsics 

_j Software pipelining 

_j Loop unrolling 


Topic Page 
2.1 Understanding Feedback ...........2..ceseeeecceen ee ecenneeees R-2| 
2:2 “Writing C/G+ 6 Code isin tise decinasione- neste cotcncanaws aieseles 
2.3. Compiling C/C++: Code cre yeieyeray sieve la etevsverers sere ete ered fatale aeteyey sy 2-12] 
2:4 NPFOnlinig: VOUr Odi yaa on anne Stee eae nearer eet b-21| 
2:5 Refining C/C++: Code . .jcic.20025. cece tents eee ce isc neeicie lee ac 2-23] 
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2.1 Understanding Feedback 


The compiler provides some feedback by default. Additional feedback is gen- 
erated with the -mw option. The feedback is located in the .asm file that the 
compiler generates. In order to view the feedback, you must also enable -k 
which retains the .asm output from the compiler. By understanding feedback, 
you can quickly tune your C code to obtain the highest possible performance. 


The feedback in Example 1-1 is for an innermost loop. On the ‘C6000, C code 
loop performance is greatly affected by how well the compiler can software 
pipeline. The feedback is geared for explaining exactly what all the issues with 
pipelining the loop were and what the results obtained were. Understanding 
feedback will focus on all the components in the software pipelining feedback 
window. 


The compiler goes through three basic stages when compiling a loop. Here we 
will focus on the comprehension of these stages and the feedback produced 
by them. This, combined with the Feedback Solutions in Appendix A will send 
you well on your way to fully optimizing your code with the C6000 compiler. 
The three stages are: 


1) Qualify the loop for software pipelining 
2) Collect loop resource and dependency graph information 


3) Software pipeline the loop 


2.1.1. Stage 1: Qualify the Loop for Software Pipelining 


The result of this stage will show up as the first three or four lines in the feed- 
back window as long as the compiler qualifies the loop for pipelining: 


Example 2-1.Stage 1 Feedback 


Known Minimum Trip Count 
Known Maximum Trip Count 


Known Max Trip Count Factor 


(1 Trip Count. The number of iterations or trips through a loop. 


(1 Minimum Trip Count. The minimum number of times the loop might exe- 
cute given the amount of information available to the compiler. 


(1 Maximum Trip Count. The maximum number of times the loop might exe- 
cute given the amount of information available to the compiler. 


Understanding Feedback 


_) Maximum Trip Count Factor. The maximum number that will divide 
evenly into the trip count. Even though the exact value of the trip count is 
not deterministic, it may be known that the value is a multiple of 2, 4, etc..., 
which allows more agressive packed data and unrolling optimization. 


The compiler tries to identify what the loop counter (named trip counter be- 
cause of the number of trips through a loop) is and any information about the 
loop counter such as minimum value (known minimum trip count), and wheth- 
er it is a multiple of something (has a known maximum trip count factor). 


If factor information is known about a loop counter, the compiler can be more 
aggressive with performing packed data processing and loop unrolling opti- 
mizations. For example, if the exact value of a loop counter is not known but 
it is known that the value is a multiple of some number, the compiler may be 
able to unroll the loop to improve performance. 


There are several conditions that must be met before software pipelining is al- 
lowed, or legal, from the compiler’s point of view. These conditions are: 


[J It cannot have too many instructions in the loop. Loops that are too big, 
typically require more registers than are available and require a longer 
compilation time. 


(J It cannot call another function from within the loop unless the called func- 
tion is inlined. Any break in control flow makes it impossible to software 
pipeline as multiple iterations are executing in parallel. 


If any of the conditions for software pipelining are not met, qualification of the 
pipeline will halt and a disqualification messages will appear. For more infor- 
mation about what disqualifies a loop from being software-pipelined, see sec- 
tion 2.5.3.6, on page 2-62. 
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2.1.2 Stage 2: Collect Loop Resource and Dependency Graph Information 


The second stage of software pipelining a loop is collecting loop resource and 
dependency graph information. The results of stage 2 will be displayed in the 
feedback window as follows: 


Example 2-2. Stage 2 Feedback 


units 
units 
units 
units 


Logical ops 
Addition ops 
Bound(.L .S 
Bound(.L .S 


o* 
, 
o* 
’ 
o* 
’ 
o* 
’ 
o* 
, 
o* 
, 
o* 
’ 
o* 
, 
o* 
’ 
o* 
, 
o* 
’ 
ox 
’ 
ok 
, 
ok 
, 
o* 
’ 
o* 
1 
ok 
, 


a) 


cross paths 
.T address paths 
Long read paths 
Long write paths 


Loop Carried Dependency Bound (%) 
Unpartitioned Resource Bound 
Partitioned Resource Bound (*) 
Resource Partition: 


A-sid 
2 


(.LS) 
(. LSD) 
.LS) 
.D .LS .LSD) 


-L or .S unit) 
-L or .S or .D unit) 


OWnAODWOrFRFOF !A 
FPWrFODOWOO SF 


Loop carried dependency bound. The distance of the largest loop carry 
path, if one exists. A loop carry path occurs when one iteration of a loop 
writes a value that must be read in a future iteration. Instructions that are 
part of the loop carry bound are marked with the * symbol in the assembly 
code saved with the —k option in the *.asm file. The number shown for the 
loop carried dependency bound is the minimum iteration interval due to a 
loop carry dependency bound for the loop. 


Often, this loop carried dependency bound is due to lack of knowledge by 
the compiler about certain pointer variables. When exact values of point- 
ers are not known, the compiler must assume that any two pointers might 
point to the same location. Thus, loads from one pointer have an implied 
dependency to another pointer performing a store and vice versa. This can 
create large (and usually unnecessary) dependency paths. When the 
Loop Carried Dependency Bound is larger than the Resource Bound, this 
is often the culprit. Potential solutions for this are shown in Appendix A, 
Feedback Solutions. 


Understanding Feedback 


(J Unpartitioned resource bound across all resources. The best case re- 
source bound minimum iteration interval before the compiler has parti- 
tioned each instruction to the A or B side. In Example 2-2, the unparti- 
tioned resource bound is 4 because the .S units are required for 8 cycles, 
and there are 2 .S units. 


.] Partitioned resource bound across all resources. The mii after the in- 
structions are partitioned to the A and B sides. In Example 2-2, after parti- 
tioning, we can see that the A side .L, .S, and .D units are required for a 
total of 13 cycles, making the partitioned resource bound [13/3] = 5. For 
more information, see the description of Bound (.L .S .D .LS .LSD) later 
in this section. 


() Resource partition table. Summarizes how the instructions have been 
assigned to the various machine resources and how they have been parti- 
tioned between the A and B side. An asterisk is used to mark those entries 
that determine the resource bound value — in other words the maximum 
mii. Because the resources on the C6000 architecture are fairly orthogo- 
nal, many instructions can execute 2 or more different functional units. For 
this reason, the table breaks these functional units down by the possible 
resource combinations. The table entries are described below: 


@ Individual Functional Units (.L .S .D .M) show the total number of 
instructions that specifically require the .L, .S, .D, or .M functional 
units. Instructions that can operate on multiple different functional 
units are notincluded in these counts. They are described below in the 
Logical Ops (.LS) and Addition Ops (.LSD) rows. 


m .Xcross paths represents the total number of AtoB and BtoA. When 
this particular row contains an asterisk, it has a resource bottleneck 
and partitioning may be a problem. 


Mm .T address paths represents the total number of address paths re- 
quired by the loads and stores in the loop. This is actually different 
from the number .D units needed as some other instructions may use 
the .D unit. In addition, there can be cases where the number of .T ad- 
dress paths on a particular side might be higher than the number of .D 
units if .D units are partitioned evenly between A and B and .T address 
paths are not. 


m Long read path represents the total number of long read port paths . 
All long operations with long sources use this port to do extended 
width (40-bit) reads. Store operations share this port so they also 
count toward this total. Long write path represents the total number of 
long write port paths. All instructions with long (40bit) results will be 
counted in this number. 
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Logical ops (.LS) represents the total number of instructions that can 
use either the .L or .S unit. 


Addition ops (.LSD) represents the total number of instructions that 
can use either the .L or .S or .D unit. 


Bound (.L .S .LS) represents the resource bound value as deter- 
mined by the number of instructions that use the .L and .S units. It is 
calculated with the following formula: 


Bound(.L .S .LS ) = ceil((.L + .S + .LS) / 2) 


Where ceil represents the ceiling function. This means you always 
round up to the nearest integer. In Example 2-2, if the B side needs: 


3 .L unit only instructions 

4 .S unit only instructions 

1 logical .LS instruction 

you would need at least [8/2] cycles or 4 cycles to issue these. 


Bound (.L .S .D .LS .LSD) represents the resource bound value as 
determined by the number of instructions that use the .D, .L and .S 
unit. It is calculated with the following formula: 

Bound(.L .S .D .LS .LSD) 

= ceil((.L + .S sD +..LS + .LSD) / 3) 


Where ceil represents the ceiling function. This means you always 
round up to the nearest integer. In Example 2-2, the A side needs: 


2 .L unit only instructions, 4 .S unit only instructions, 1 .D unit only in- 
structions, 0 logical .LS instructions, and 6 addition .LSD instructions 


You would need at least [13/3] cycles or 5 cycles to issue these. 


Understanding Feedback 


2.1.3 Stage 3: Software Pipeline the Loop 


Once the compiler has completed qualification of the loop, partitioned it, and 
analyzed the necessary loop carry and resource requirements, it can begin to 
attempt software pipelining. This section will focus on the following lines from 
the feedback example: 


Example 2-3.Stage 3 Feedback 


Searching for software pipeline schedule at ... 

aa Register is live too long 

ii Did not find schedule 

ii Schedule found with 3 iterations in parallel 
done 


Epilog not entirely removed 
Collapsed epilog stages 


Prolog not removed 
Collapsed prolog stages : 0 


inimum required memory pad : 2 bytes 


ok 
’ 
o* 
y 
7 * 
’ 
ok 
’ 
o* 
’ 
ok 
’ 
ok 
’ 
o* 
’ 
ok 
’ 
o* 
’ 
7 * 
’ 
ok 
y 
ok 
’ 
ok 
’ 
7 * 
’ 


inimum safe trip count 2 


(J Iteration interval (ii). The number of cycles between the initiation of 
successive iterations of the loop. The smaller the iteration interval, the 
fewer cycles it takes to execute a loop. All of the numbers shown in each 
row of the feedback imply something about what the minimum iteration in- 
terval (mii) will be for the compiler to attempt initial software pipelining. 


Several things will determine what the mii of the loop is and are described 
in the following sections. The mii is simply the maximum of any of these 
individual mii’s. 


The first thing the compiler attempts during this stage, is to schedule the loop 
at an iteration interval (ii) equal to the mii determined in stage 2: collect loop 
resource and dependency graph information. In the example above, since the 
A-side bound (.L, .S, .D, .LS, and .LSD) was the mii bottleneck, our example 


starts with: 
ne Searching for software pipeline schedule at 
ee ii = 5 Register is live too long 


If the attempt was not successful, the compiler provides additional feedback 
to help explain why. In this case, the compiler cannot find a schedule at 11 
cycles because register is live too long. For more information about live too 
long issues, see section 5.10, on page 5-101. 
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Sometimes the compiler finds a valid software pipeline schedule but one or 
more of the values is live too long. Lifetime of a register is determined by the 
cycle a value is written into it and by the last cycle this value is read by another 
instruction. By definition, a variable can never be live longer than the ii of the 
loop, because the next iteration of the loop will overwrite that value before it 
is read. 


The compiler then proceeds to: 
ii = 6 Did not find schedule 


Sometimes, due to a complex loop or schedule, the compiler simply cannot 
find a valid software pipeline schedule at a particular iteration interval. 


Regs Live Always : 1/5 (A/B-side) 
Max Regs Live : 14/19 
Max Cond Regs Live : 1/0 


Lj] Regs Live Always refers to the number of registers needed for variables 
to be live every cycle in the loop. Data loaded into registers outside the 
loop and read inside the loop will fall into this category. 


() Max Regs Live refers to the maximum number of variable live on any one 
cycle in the loop. If there are 33 variables live on one of the cycles inside 
the loop, a minimum of 33 registers is necessary and this will not be pos- 
sible with the 32 registers available on the ’C62x and ’C67x cores. In addi- 
tion, this is broken down between A and B side, so if there is uneven parti- 
tioning with 30 values and there are 17 on one side and 13 on the other, 
the same problem will exist. This situation does not apply to the 64 regis- 
ters available on the ’C64x core. 


() Max Cond Regs Live tells us if there are too many conditional values 
needed on a given cycle. The ’C62x and ’C67x cores have 2 A side and 
3 B side condition registers available. The ’C64x core has 3 A side and 3 
B side condition registers available. 


After failing at ii = 6, the compiler proceeds to ii = 7: 


ii = 7 Schedule found with 3 iterations in parallel 


It is successful and finds a valid schedule with 3 iterations in parallel. This 
means it is pipelined 3 deep. In other words, before iteration n has completed, 
iterations n+1 and n+2 have begun. 


Each time a particular iteration interval fails, the ii is increased and retried. This 
continues until the ii is equal to the length of a list scheduled loop (no software 
pipelining). This example shows two possible reasons that a loop was not soft- 
ware pipelined. To view the full detail of all possible messages and their de- 
scriptions, see Feedback Solutions in Appendix A. 


Understanding Feedback 


After a successful schedule is found at a particular iteration interval, more in- 
formation about the loop is displayed. This information may relate to the load 
threshold, epilog/prolog collapsing, and projected memory bank conflicts. 


Speculative Load Threshold : 12 


When an epilog is removed, the loop is run extra times to finish out the last it- 
erations, or pipe—down the loop. In doing so, extra loads from new iterations 
of the loop will speculatively execute (even though their results will never be 
used). In order to ensure that these memory accesses are not pointing to inval- 
id memory locations, the Load Threshold value tells you how many extra bytes 
of data beyond your input arrays must be valid memory locations (not a 
memory mapped I/O etc) to ensure correct execution. In general, in the large 
address space of the ‘C6000 this is not usually an issue, but you should be 
aware of this. 


Epilog not entirely removed 
Collapsed epilog stages : 1 


This refers to the number of epilog stages, or loop iterations that were re- 
moved. This can produce alarge savings in code size. The —mh enables spec- 
ulative execution and improves the compiler’s ability to remove epilogs and 
prologs. However, in some cases epilogs and prologs can be partially or en- 
tirely removed without speculative execution. Thus, you may see nonzero val- 
ues for this even without the —mh option. 


Prolog not removed 
Collapsed prolog stages : 0 


This means that the prolog was not removed. For various technical reasons, 
prolog and epilog stages may not be partially or entirely removed. 

Minimum required memory pad : 2 bytes 
The minimum required memory padding to use -mh is 2 bytes. See the 


TMS320C6000 Optimizing C/C++ Compiler User’s Guide for more informa- 
tion on the -mh option and the minimum required memory padding. 


Minimum safe trip count :2 
This means that the loop must execute at lease twice to safely use the software 
pipelined version of the loop. If this value is less than the known minimum trip 


count, two versions of the loop will be generated. For more information on elim- 
inating redundant loops, see section 2.5.3.2, on page 2-55. 
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2.2 Writing C/C++ Code 


2.2.1 


This chapter shows you how to analyze and tailor your code to be sure you are 
getting the best performance from the ’C6000 architecture. 


Tips on Data Types 


Give careful consideration to the data type size when writing your code. The 
°C6000 compiler defines a size for each data type (signed and unsigned): 


(1 char 8 bits 

LJ short 16 bits 

(1 int 32 bits 

Lj] long 40 bits 

Lj float 32 bits 

[J double 64 bits 

Based on the size of each data type, follow these guidelines when writing C 
code: 

(J Avoid code that assumes that int and long types are the same size, be- 


cause the ’C6000 compiler uses long values for 40-bit operations. This 
can cause extra instructions to be generated and limit functional unit 
selection. 


Use the short data type for fixed-point multiplication inputs whenever pos- 
sible because this data type provides the most efficient use of the 16-bit 
multiplier in the ’C6000 (1 cycle for “short * short” versus 5 cycles for “int 
* int”). 


Use int or unsigned int types for loop counters, rather than short or un- 
signed short data types, to avoid unnecessary sign-extension instructions. 


When using floating-point instructions on a floating-point device such as 
the ‘C6700, use the —-mv6700 compiler switch so the code generated will 
use the device’s floating-point hardware instead of performing the task 
with fixed point hardware. For example, if the -mv6700 option is not used, 
the RTS floating-point multiply will be used instead of the MPYSP instruc- 
tion. 


When using the C6400 device, use the —-mv6400 compiler switch so the 
code generated will use the device’s additional hardware and instructions. 


Writing C/C++ Code 


2.2.2 Analyzing C Code Performance 


Use the following techniques to analyze the performance of specific code re- 
gions: 


Ly 


One of the preliminary measures of code is the time it takes the code to 
run. Use the clock( ) and printf( ) functions in C/C++ to time and display 
the performance of specific code regions. You can use the stand-alone 
simulator (load6x) to run the code for this purpose. Remember to subtract 
out the overhead of calling the clock( ) function. 


Use the profile mode of the stand-alone simulator. This can be done by 
compiling your code with the —mg option and executing load6x with the —g 
option. The profile results will be stored in a file with the .vaa extension. 
Refer to the TMS320C6000 Optimizing C/C++ Compiler User’s Guide for 
more information. 


Enable the clock and use profile points and the RUN command inthe Code 
Composer debugger to track the number of CPU clock cycles consumed 
by a particular section of code. Use “View Statistics” to view the number 
of cycles consumed. 


The critical performance areas in your code are most often loops. The 
easiest way to optimize a loop is by extracting it into a separate file that 
can be rewritten, recompiled, and run with the stand-alone simulator 
(load6x). 


As you use the techniques described in this chapter to optimize your C/C++ 
code, you can then evaluate the performance results by running the code and 
looking at the instructions generated by the compiler. 
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2.3 Compiling C/C++ Code 


The ‘C6000 compiler offers high-level language support by transforming your 
C/C++ code into more efficient assembly language source code. The compiler 
tools include a shell program (cl6x), which you use to compile, assembly opti- 
mize, assemble, and link programs in a single step. To invoke the compiler 
shell, enter: 


cl6x [options] [filenames] [-z [linker options] [object files]] 


For a complete description of the C/C++ compiler and the options discussed 
in this chapter, see the TMS320C6000 Optimizing C/C++ Compiler User’s 
Guide (SPRU187). 


2.3.1. Compiler Options 


Options control the operation of the compiler. This section introduces you to 
the recommended options for performance, optimization, and code size. Con- 
siderations of optimization versus performance are also discussed. 


The options described in Table 2—1 are obsolete or intended for debugging, 
and could potentially decrease performance and increase code size. Avoid us- 
ing these options with performance critical code. 


Table 2-1. Compiler Options to Avoid on Performance Critical Code 


Option Description 


—g/-s/ These options limit the amount of optimization across C state- 
—ss/—gp ments leading to larger code size and slower execution. 


—mu Disables software pipelining for debugging. Use -ms2/—ms3 
instead to reduce code size which will disable software pipelin- 
ing among other code size optimizations. 


—o1/—00 Always use —02/—03 to maximize compiler analysis and opti- 
mization. Use code size flags (msn) to tradeoff between per- 
formance and code size. 


—mz Obsolete. On pre—3.00 tools, this option may have improved 
your code, but with 3.00+ compilers, this option will decrease 
performance and increase code size. 


Compiling C/C++ Code 


The options in Table 2—2 can improve performance but require certain charac- 
teristics to be true, and are described below. 


Table 2-2. Compiler Options for Performance 


Option 
-o3t 


—oi0 


—pmt 


Description 


Represents the highest level of optimization available. Various 
loop optimizations are performed, such as software pipelining, 
unrolling, and SIMD. Various file level characteristics are also 
used to improve performance. 


Disables all automatic size—controlled inlining, (which is en- 
abled by —o3). User specified inlining of functions is still al- 
lowed. 


Combines source files to perform program-level optimization by 
allowing visibility to the entire application source. 


t Although -03 is preferable, at a minimum use the —o option. 
+ Use the —pm option for as much of your program as possible. 


Table 2-3. Compiler Options That Slightly Degrade Performance and Improve Code Size 


Option 


—mh<n> 


—mi<n> 


—op2 


Description 


Allows speculative execution. The appropriate amount of pad- 
ding must be available in data memory to insure correct execu- 
tion. This is normally not a problem but must be adhered to. 


Describes the interrupt threshold to the compiler. If you know 
that NO interrupts will occur in your code, the compiler can 
avoid enabling and disabling interrupts before and after soft- 
ware pipelined loops for a code size and performance improve- 
ment. In addition, there is potential for performance improve- 
ment where interrupt registers may be utilized in high register 
presure loops. 


Optimizes primarily for performance, and secondly for code 
size. Could be used on all but the most performance critical 
routines. 


Enables the compiler to use assumptions that allow it to be 
more aggressive with certain optimizations. When used on lin- 
ear assembly files, it acts like a .no_mdep directive that has 
been defined for those linear assembly files. 


Specifies that the module contains no functions or variables that 
are called or modified from outside the source code provided to 
the compiler. This improves variable analysis and allowed as- 
sumptions. 


Optimizing C/C++ Code 2-13 


Compiling C/C++ Code 


The options described in Table 2-4 are recommended for control code, and 
will result in smaller code size with minimal performance degradation. 


Table 2-4. Compiler Options for Control Code 


Option 
—o3t 


—pmt 


—op2 


—oi0 


—ms2-—ms3 


Description 


In addition to the optimizations described in Table 2—2, -03 can 

perform other code size reducing optimizations like: eliminating 
unused assignments, eliminating local and global common sub- 
expressions, and removing functions that are never called. 


Combines source files to perform program—level optimization by 
allowing visibility to the entire application source. 


Specifies that the module contains no functions or variables that 
are called or modified from outside the source code provided to 
the compiler. This improves variable analysis and allowed as- 
sumptions. 


Disables all automatic size—controlled inlining, (which is en- 
abled by —03). User specified inlining of functions is still al- 
lowed. 


Optimizes primarily for code size, and secondly for perfor- 
mance. 


tT Although -03 is preferable, at a minimum use the —o option. 
+ Use the —pm option for as much of your program as possible. 


The options described in Table 2—5 provide information, but do not affect per- 
formance or code size. 


Table 2-5. Compiler Options for Information 
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Option 


—Mmw 


—-k 


—s/—ss 


Description 


Use this option to produce additional compiler feedback. This 
option has no performance or code size impact. 


Keeps the assembly file so that you can inspect and analyze 
compiler feedback. This option has no performance or code 
size impact. 


Enables automatic function level profiling with the loader. Can 
result in minor performance degradation around function call 
boundaries only. 


Interlists C/C++ source or optimizer comments in assembly. 
The -s option may show minor performance degradation. The 
-ss option may show more severe performance degradation. 
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2.3.2 Memory Dependencies 


To maximize the efficiency of your code, the ‘C6000 compiler schedules as 
many instructions as possible in parallel. To schedule instructions in parallel, 
the compiler must determine the relationships, or dependencies, between in- 
structions. Dependency means that one instruction must occur before anoth- 
er, for example, a variable must be loaded from memory before it can be used. 
Because only independent instructions can execute in parallel, dependencies 
inhibit parallelism. 


(J Ifthe compiler cannot determine that two instructions are independent (for 
example, b does not depend on a), it assumes a dependency and sched- 
ules the two instructions sequentially accounting for any latencies needed 
to complete the first instruction. 


]_ Ifthe compiler can determine that two instructions are independent of one 
another, it can schedule them in parallel. 


Often it is difficult for the compiler to determine if instructions that access 
memory are independent. The following techniques help the compiler deter- 
mine which instructions are independent: 


_j Use the restrict keyword to indicate that a pointer is the only pointer that 
can point to a particular object in the scope in which the pointer is declared. 


_j Usethe-pm (program-level optimization) option, which gives the compiler 
global access to the whole program or module and allows it to be more 
aggressive in ruling out dependencies. 


(J Use the —mt option, which allows the compiler to use assumptions that al- 
low it to eliminate dependencies. Remember, using the —mt option on lin- 
ear assembly code is equivalent to adding the .no_mdep directive to the 
linear assembly source file. Specific memory dependencies should be 
specified with the .mdep directive. For more information see section 4.4, 
Assembly Optimizer Directives in the TMS320C6000 Optimizing C/C++ 
Compiler User’s Guide. 
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To illustrate the concept of memory dependencies, it is helpful to look at the 
algorithm code in a dependency graph. Example 2—4 shows the C code for a 
basic vector sum. Figure 2—1 shows the dependency graph for this basic vec- 
tor sum. For more information about drawing a dependency graph, see section 
5.3.4, on page 5-11. 


Example 2-4. Basic Vector Sum 


void vecsum(short *sum, short *inl, short *in2, unsigned int N) 


Load Load 


Number of cycles required Add elements { 
to complete an instruction —————>| 1 % ¥ 
1 
Store to 
memory 
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The dependency graph in Figure 2-1 shows that: 


(J The paths from sum[i] back to in1[i] and in2[i] indicate that writing to sum 
may have an effect on the memory pointed to by either in1 or in2. 


(J A read from int or in2 cannot begin until the write to sum finishes, which 
creates an aliasing problem. Aliasing occurs when two pointers can point 
to the same memory location. For example, if vecsum( ) is called in a pro- 
gram with the following statements, in1 and sum alias each other because 
they both point to the same memory location: 


short a[10], b[10]; 
vecsum(a, a, b, 10); 


2.3.2.1. The Restrict Keyword 


To help the compiler determine memory dependencies, you can qualify a 
pointer, reference, or array with the restrict keyword. The restrict keyword is 
a type qualifier that may be applied to pointers, references, and arrays. Its use 
represents a guarantee by the programmer that within the scope of the pointer 
declaration, the object pointed to can be accessed only by that pointer. Any 
violation of this guarantee renders the program undefined. This practice helps 
the compiler optimize certain sections of code because aliasing information 
can be more easily determined. 


Inthe example that follows, you can use the restrict keyword to tell the compiler 
that a and b never point to the same object in foo (and the objects’ memory that 
foo accesses does not overlap). 


Example 2-5. Use of the Restrict Type Qualifier With Pointers 


void foo(int * restrict a, int * restrict b) 


{ 


/* fo00's code here */ 


} 


This example is a use of the restrict keyword when passing arrays to a function. 
Here, the arrays c and d should not overlap, nor should c and d point to the 
same array. 
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Example 2-6. Use of the Restrict Type Qualifier With Arrays 


void funcl(int c[restrict], int d[restrict]) 


{ 


int i; 


for(i 


Do not use the restrict keyword with code such as listed in Example 2-7. By 
using the restrict keyword in Example 2-7, you are telling the compiler that it 
is legal to write to any location pointed to by a before reading the location 
pointed to by b. This can cause an incorrect program because both a and b 
point to the same object —array. 


Example 2-7. Incorrect Use of the restrict Keyword 
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void func (short *a, short * restrict b)/*Bad!! */ 
{ 

int i; 

for (i = 11; i < 44; i++) *(--a) #(==b) } 


} 

void main () 

{ 

short array[] = { Ly, 2, 3 4, By Oe Ver By By, TU, 

Li L2; 23 24, 25,. 167 £7; 18; 
19, 20, 21, 22, 23, 24, 25, 26, 
275 2B 29, 30; 31, 32,7 33, 34; 
35, 36, 37, 38, 39, 40, 41, 42, 


short *ptrl, *ptr2; 


ptr2 = array + 44; 
ptrl = ptr2 - 11; 


func(ptr2, ptrl); /*Bad!! */ 


‘. .. .. . sc ..- °° °°»... =. —-. |=. .  “—.  .-.  .  . -. = Ml 
Note: Do not use const to tell the compiler that two pointers do not point 


to the same object. Use restrict instead. 
| 
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2.3.2.2 The -—mt Option 


Another way to eliminate memory dependencies is to use the —mt option, 
which allows the compiler to use assumptions that can eliminate memory de- 
pendency paths. For example, if you use the —mt option when compiling the 
code in Example 2-4, the compiler uses the assumption that that in1 and in2 
do not alias memory pointed to by sum and, therefore, eliminates memory de- 
pendencies among the instructions that access those variables. 


If your code does not follow the assumptions generated by the —mt option, you 
can get incorrect results. For more information on the —mt option refer to the 
TMS320C6000 Optimizing Compiler User’s Guide (SPRU187). 
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2.3.3 Performing Program-Level Optimization (-pm Option) 


You can specify program-level optimization by using the —pm option with the 
—03 option. With program-level optimization, all your source files are compiled 
into one intermediate file giving the compiler complete program view during 
compilation. This creates significant advantage for determining pointer loca- 
tins passed into a function. Once the compiler determines two pointers do not 
access the same memory location, substantial improvements can be made in 
software pipelined loops. Because the compiler has access to the entire pro- 
gram, itperforms several additional optimizations rarely applied during file-lev- 
el optimization: 


() If aparticular argument in a function always has the same value, the com- 
piler replaces the argument with the value and passes the value instead 
of the argument. 


_j Ifareturn value of a function is never used, the compiler deletes the return 
code in the function. 


(1 Ifa function is not called, directly or indirectly, the compiler removes the 
function. 


Also, using the —pm option can lead to better schedules for your loops. If the 
number of iterations of a loop is determined by a value passed into the function, 
and the compiler can determine what that value is from the caller, then the 
compiler will have more information about the minimum trip count of the loop 
leading to a better resulting schedule. 
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2.4 Profiling Your Code 


2.4.1 


In large applications, it makes sense to optimize the most important sections 
of code first. You can use the information generated by profiling options to get 
started. You can use several different methods to profile your code. These 
methods are described below. 


Using the Standalone Simulator (load6x) to Profile 


There are two methods to using the standalone simulator (load6x) for profiling. 


_) If you are interested in just a profile of all of the functions in your applica- 
tion, you can use the —g option in load6x. 


_) Ifyou are interested in just profiling the cycle count of one or two functions, 
or if you are interested in a region of code inside a particular function, you 
can use Calls to the clock( ) function (supported by load6x) to time those 
particular functions or regions of code. 


2.4.1.1 Using the —g Option to Profile on load6x 


Invoking load6x with the —g option runs the standalone simulator in profiling 
mode. Source files must be compiled with the —mg profiling option for profiling 
to work on the standalone simulator. The profile results are stored in a file 
called by the same name as the .out file, but with the .vaa extension. 


If you used the —mg profiling option when compiling and linking *"example.out”, 
use the —g option to create a file in which you can view the profiling results. For 
example, enter the following on your command line: 

load6éx -g example.out 


Now, you can view the file "example.vaa” to see the results of the profile ses- 
sion created with the —mg option on the .out file. 


Your new file, ”example.vaa” should have been created in the same directory 
as the .out file. You can edit the .vaa file with a text editor. You should see some- 
thing like this: 


Program Name: example.out 

Start Address: 00007980 main, at line 1, "“demol.c” 
Stop Address: 00007860 exit 

Run Cycles: 3339 

Profile Cycles: 3339 

BP Hits: 11 

KKK KKK KK KKK KKKKK KKK KKK KKK KEKE KKK KK KKK KKKKKKKKK KK KKK KKK KKK 
Area Name Count Inclusive Incl-Max Exclusive Excl-Max 
CF iirl( ) 1 236 236 236 236 
CF vec_mpyl( ) 1 248 248 248 248 
CF macl( ) 1 168 168 168 168 
CF main( ) 1 3333 3333 40 40 


Optimizing C/C++ Code 2-21 


Profiling Your Code 


Count represents the number of times each function was called and entered. 
Inclusive represents the total cycle time spent inside that function including 
calls to other functions. Incl—Max (Inclusive Max) represents the longest time 
spent inside that function during one call. Exclusive and Excl—Max are the 
same as Inclusive and Incl—Max except that time spent in calls to other func- 
tions inside that function have been removed. 


2.4.1.2 Using the Clock( ) Function to Profile 


To get cycle count information for a function or region of code with the standa- 
lone simulator, embed the clock( ) function in your C code. The following exam- 
ple demonstrates how to include the clock() function in your C code. 


Example 2-8. Including the clock( ) Function 


#include <stdio.h> 
#include <time.h> /* need time.h in order to call clock()*/ 


main(int argc, char *argv[]) { 
const short coefs[150]; 

short optr[150]; 

short state[2]; 

const short a[150]; 

const short b[150]; 

int c = 0; 

int dotp[1] = {0}; 

int sum= 0; 

short y[150]; 

short scalar = 3345; 

const short x[150]; 

clock_t start, stop, overhead; 


start = clock(); /* Calculate overhead of calling clock*/ 


stop = clock(); /* and subtract this value from The results*/ 
overhead = stop - start; 


= clock(); 
macl(a, b, c, dotp); 
clock (); 
tf£("macl cycles: %d\n”, stop -— start - overhead); 


t clock (); 

ec_mpyl(y, x, scalar); 

= clock(); 

tf("vec_mpyl cycles: %d\n”, stop - start - over head); 


= clock(); 
(coefs, x, optr, state); 
= clock(); 
tf("iirl cycles: Sd\n”, stop — start - overhead); 
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2.5 Refining C/C++ Code 


You can realize substantial gains from the performance of your C/C++ code 
by refining your code in the following areas: 


_j Using intrinsics to replace complicated C/C++ code 


_j Using word access to operate on 16-bit data stored in the high and low 
parts of a 32-bit register 


(4 Using double access to operate on 32-bit data stored in a 64-bit register 
pair ((C64x and ’C67x only) 
2.5.1 Using Intrinsics 


The ’C6000 compiler provides intrinsics, special functions that map directly to 
inlined ’C62x/’C64x/'C67x instructions, to optimize your C/C++ code quickly. 
All instructions that are not easily expressed in C/C++ code are supported as 
intrinsics. Intrinsics are specified with a leading underscore (_) and are ac- 
cessed by calling them as you call a function. 


For example, saturated addition can be expressed in C/C++ code only by writ- 
ing a multicycle function, such as the one in Example 2-9. 


Example 2-9. Saturated Add Without Intrinsics 


int sadd(int a, int b) 
{ 

int result; 

result 


& 0x80000000) == 0) 


((result * a) & O0x80000000) 


result = (a < 0) ? Ox80000000 : Ox7fffffff; 


} 


return (result); 


This complicated code can be replaced by the _sadd( ) intrinsic, which results 
in a single ’C6x instruction (see Example 2-10). 


Example 2-10. Saturated Add With Intrinsics 


result = _sadd(a,b); 
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Table 2-6 lists the ‘C6000 intrinsics. For more information on using intrinsics, 
see the TMS320C6000 Optimizing C/C++ Compiler User’s Guide. 


Table 2-6. TMS320C6000 C/C++ Compiler Intrinsics 


Assembly 
C Compiler Intrinsic Instruction Description Device 
int _abs(int src2); ABS Returns the saturated absolute value of 
int_labs(long src2); src2. 
int _abs2 (int src2); ABS2 Calculates the absolute value for each °C64x 
16-bit value. 
int __add2(int src7, int src2); ADD2 Adds the upper and lower halves of src1 to 
the upper and lower halves of src2 and re- 
turns the result. Any overflow from the low- 
er half add will not affect the upper half 
add. 
int _add4 (int src7, int src2); ADD4 Performs 2s—complementadditionto pairs *C64x 
of packed 8—bit numbers. 
ushort & _amem2(void “ptr); LDHU/ Allows aligned loads and stores of 2 bytes 
STH to memory. 
uint & _amemA4(void “ptr); LDW/ Allows aligned loads and stores of 4 bytes 
STW to memory. 
double & _amemd8(void * ptr); LDDW/ Allows aligned loads and stores of 8bytes ’C64x 
STDW or to memory. or 
2 LDW/ all 
2 STW 
const ushort & _amem2_const(const LDHU Allows aligned loads of 2 bytes to memory. 
void “ptr); 
const uint &_amem4_const(const void LDW Allows aligned loads of 4 bytes to memory. 
*ptr); 
const double & amemd8_const(const LDDW Allows aligned loads of 8bytesto memory. ’C64x 
void * ptr); or or 
2LDW all 
int __avg2 (int src7, int src2); AVG2 Calculates the average for each pair of ‘C64x 
signed 16-bit values. 
unsigned _avgu4(uint src7, uint src2); AVGU4 Calculates the average for each pair ofun- ’C64x 


signed 8-bit values. 


Note: Instructions not specified with a device apply to all ‘C6000 devices. 
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Table 2-6. TMS320C6000 C/C++ Compiler Intrinsics (Continued) 


Assembly 
C Compiler Intrinsic Instruction Description Device 


unsigned _ bitc4 (uint src2); BITC4 For each of the 8-bit quantities in src, the ’C64x 
number of 1 bits is written to the corre- 
sponding position in the return value. 


unsigned _bitr (uint src2); BITR Reverses the order of the bits. ’C64x 


uint _clr(uint src2, uint csta, uint cstb); CLR Clears the specified field in src2. The be- 
ginning and ending bits of the field to be 
cleared are specified by csta and cstb, re- 
spectively. 


unsigned _clrr(uint src7, int src2); CLR Clears the specified field in src2. The be- 
ginning and ending bits of the field to be 
cleared are specified by the lower 10 bits 
of the source register. 


int_cmpeq2 (int src7, int src2); CMPEQ2 Performs equality comparisons on each ’C64x 
pair of 16-bit values. Equality results are 
packed into the two least—significant bits of 
the return value. 


int_cmped4 (int src7, int src2); CMPEQ4 Performs equality comparisons on each ’C64x 
pair of 8—bit values. Equality results are 
packed into the four least—significant bits 
of the return value. 


int_empgt2 (int src7, int src2); CMPGT2 Compares each pair of signed 16-bit val- ‘C64x 
ues. Results are packed into the two least— 
significant bits of the return value. 


unsigned _cmpgtué4 (uint src7, uint CMPGTU4 Compares each pair of unsigned 8-bitval- ‘C64x 
SIc2); ues. Results are packed into the four 
least-significant bits of the return value. 


unsigned _ deal (uint src2); DEAL The odd and even bits of src are extracted ‘C64x 
into two separate 16-bit values. 


int _dotp2 (int src7, int src2); DOTP2 The product of signed lower 16-bit values *C64x 
double _Idotp2 (int src7, int src2); LDOTP2 of src1 and src2 is added to the product of 

signed upper 16-bit values of src1 and 

src2. 


int _dotpn2 (int src7, int src2); DOTPN2 The product of signed lower 16-bit values *C64x 
of srci1 and src2 is subtracted from the 
product of signed upper 16-bit values of 
srci1 and src2. 


Note: Instructions not specified with a device apply to all "C6000 devices. 
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Table 2-6. TMS320C6000 C/C++ Compiler Intrinsics (Continued) 


C Compiler Intrinsic 


int_dotpnrsu2 (int src7, uint src2); 


Assembly 


Instruction Description 


DOTPNR- 
SU2 


The product of unsigned lower 16-bit val- 
ues in src1 and src2 is subtracted from the 
product of signed upper 16-bit values in 
src1 and src2. 215 is added and the result 
is sign shifted right by 16. 


Device 
*C64x 


int _dotprsu2 (int src7, uint src2); 


DOTPR- 
SU2 


The product of the first signed pair of 
16-bit values is added to the product of 
the unsigned second pair of 16-bit values 
in src1 and src2. 215 is added and the re- 
sult is sign shifted right by 16. 


’C64x 


unsigned _dotpué4 (uint src7, uint src2); 
int _dotpsu4 (int src7, uint src2); 


DOTPU4 
DOTPSU4 


For each pair of 8—bit values in src1 and 
src2, the 8—bit value from src is multiplied 
with the 8—bit value from src2. The four 
products are summed together. 


*C64x 


int_dpint(double); 


DPINT 


Converts 64-bit double to 32-bit signed in- 
teger, using the rounding mode set by the 
CSR register. 


*C67x 


long _dtol(double src); 


Reinterprets double register pair a a long 
register pair. 


int _ext(int src2, uint csta, int cstb); 


EXT 


Extracts the specified field in src2, sign-ex- 
tended to 32 bits. The extract is performed 
by a shift left followed by a signed shift 
right; csta and cstb are the shift left and 
shift right amounts, respectively. 


int __extr(int src2, int src7); 


EXT 


Extracts the specified field in src2, sign-ex- 
tended to 32 bits. The extract is performed 
by a shift left followed by a signed shift 
right; csta and cstb are the shift left and 
shift right amounts, respectively. 


uint _extu(uint src2, uint csta, uint cstb); 


Note: 
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EXTU 


Extracts the specified field in src2, zero- 
extended to 32 bits. The extract is per- 
formed by a shift left followed by a un- 
signed shift right; csta and cstb are the 
shift left and shift right amounts, respec- 
tively. 


Instructions not specified with a device apply to all "C6000 devices. 
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Table 2-6. TMS320C6000 C/C++ Compiler Intrinsics (Continued) 


Assembly 
Instruction Description 


C Compiler Intrinsic 


uint _extur(uint src2, int src7); EXTU 


Device 


Extracts the specified field in src2, zero- 
extended to 32 bits. The extract is per- 
formed by a shift left followed by a un- 
signed shift right; csta and cstb are the 
shift left and shift right amounts, respec- 
tively. 


uint _ ftoi(float); 


Reinterprets the bits in the float as an un- 
signed integer. 
(Ex: _ftoi(1.0) == 106535321 6U) 


int_gmpy4 (int src7, int src2); GMPY4 


Performs the galois field multiply on 4 val- ’C64x 
ues in src1 with 4 parallel values in src2. 

The 4 products are packed into the return 

value. 


uint _hi(double); 


Returns the high 32 bits of a double as an 
integer. 


double _itod(uint, uint); 


Creates a new double register pair from 
two unsigned integers. 


float _itof(uint); 


Reinterprets the bits in the unsigned inte- 
ger as a float. 
(Ex: _itof(0x3f800000) == 1.0) 


LDOTP2 
DOTP2 


double _Idotp2 (int src7, int src2); 
int _dotp2 (int src7, int src2); 


The product of the lower signed 16-bit val- 
ues insrc1 and src2 are added to the prod- 
uct of the upper signed 16-bit values in 
src1 and src2. In the Idotp2 vecsum, the 
result is sign extended to 64 bits. 


uint _Imbd(uint src7, uint src2); LMBD 


Searches for aleftmost 1 or 0 of src2 deter- 
mined by the LSB of src1. Returns the 
number of bits up to the bit change. 


uint _lo(double); 


Returns the low (even) register of a double 
register pair as an integer. 


double _Itod(long src); 


Reinterprets long register pair src as a 
double register pair. 


long _dtol(double src); 


Reinterprets double register pair a a long 
register pair. 


Note: Instructions not specified with a device apply to all "C6000 devices. 
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Table 2-6. TMS320C6000 C/C++ Compiler Intrinsics (Continued) 


Assembly 
C Compiler Intrinsic Instruction Description Device 
int _max2 (int src7, int src2); MAX2 Places the larger/smaller of each pair of ’C64x 
int _min2 (int src7, int src2); MIN2 values in the corresponding position in the 
unsigned _maxu4 (uint src7, uint src2); MAXU4 return value. Values can be 16-bit signed 
unsigned _minué4 (uint src7, uint src2); MINU4 or 8—bit unsigned. 
ushort & _mem2(void “ptr); 2LDB/ Allows unaligned loads and stores of 2 by- 
2 STB tes to memory. 
uint & memA(void “ptr); LDNW/ Allows unaligned loads and stores of 4by- ’C64x 
STNW tes to memory 
double & _memd8(void “ptr); LDNDW/ Allows unaligned loads and stores of 8by- ’C64x 
STNDW tes to memory. 
const ushort & _mem2_const(const void 2LDB Allows unaligned loads of 2 bytes to 
*otr); memory. 
const uint & mem4_consi(const void LDNW Allows unaligned loads of 4 bytes to ’C64x 
*ptr); memory. 
const double & _memd8_const(const LDNDW Allows unaligned loads of 8 bytes to ’C64x 
void * ptr); memory. 
double _mpy2 (int src7, int src2); MPY2 Returns the products of the lower and ’C64x 
higher 16-bit values in src1 and src2. 
double _mpyhi (int src7, int src2); MPYHI Produces a 16 by 32 multiply. The resultis *C64x 
double _mpyli (int src7, int src2); MPYLI placed into the lower 48 bits of the returned 
double. Can use the upper or lower 16 bits 
of src1. 
int_mpyhir (int src7, int src2); MPYHIR Produces a signed 16 by 32 multiply. The ’C64x 
int_mpylir (int src7, int src2); MPYLIR result is shifted right by 15 bits. Can use 
the upper or lower 16 bits of src1. 
double _mpysué (int src7, uint src2); MPYSU4 For each 8-bit quantity in src1 and src2, ’C64x 
double _mpyué4 (uint src7, uint src2); MPYU4 performs an 8—bit by 8—bit multiply. The 
four 16-bit results are packed into a 
double. The results can be signed or un- 
signed. 
int_mpy(int src7, int src2); MPY Multiplies the 16 LSBs of src1 by the 16 
int_mpyus(uint src7, int src2); MPYUS LSBs of src2 and returns the result. Values 
int_mpysu(int src7, uint src2); MPYSU can be signed or unsigned. 
uint_mpyu(uint src7, uint src2); MPYU 


Note: 
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Table 2-6. TMS320C6000 C/C++ Compiler Intrinsics (Continued) 


Assembly 

C Compiler Intrinsic Instruction Description Device 
int_mpyh(int src7, int src2); MPYH Multiplies the 16 MSBs of src1 by the 16 
int_mpyhus(uint src7, int src2); MPYHUS MSBs of src2 and returns the result. Val- 
int_mpyhsu(int src7, uint src2); MPYHSU ues can be signed or unsigned. 
uint_mpyhu(uint src7, uint src2); MPYHU 
int_mpyhl(int src7, int src2); MPYHL Multiplies the 16 MSBs of src1 by the 16 
int_mpyhuls(uint src7, int src2); MPYHULS _ LSBsofsrc2 and returns the result. Values 
int_mpyhslu(int src7, uint src2); MPYHSLU can be signed or unsigned. 
uint_mpyhlu(uint src7, uint src2); MPYHLU 
int_mpylh(int src7, int src2); MPYLH Multiplies the 16 LSBs of src1 by the 16 
int_mpyluhs(uint src7, int src2); MPYLUHS'-  MSBs of src2 and returns the result. Val- 
int_mpylshu(int src7, uint src2); MPYLSHU _ ues can be signed or unsigned. 
uint_mpylhu(uint src7, uint src2); MPYLHU 
int_mvd (int src2); MVD Moves the data from the src to the return *C64x 

value over 4 cycles using the multipler 

pipeline. 
void _nassert(int); Generates no code. Tells the optimizer 

that the expression declared with the 

assert function is true. This gives a hint to 

the compiler as to what optimizations 

might be valid (word-wide optimizations). 
uint__norm(int src2); NORM Returns the number of bits up to the first 
uint _Inorm(long src2); nonredundant sign bit of src2. 
unsigned _pack2 (uint src7, uint src2); PACK2 The lower/upper half-words of src1 and ’C64x 
unsigned _packh2 (uint src7, uint src2); © PACKH2 src2 are placed in the return value. 
unsigned _packh4 (uint src7, uint src2); PACKH4 Packs alternate bytes into return value. ’C64x 
unsigned _packl4 (uint src7, uint src2); PACKL4 Can pack high or low bytes. 
unsigned _packhl2 (uint src7, uint src2); .PACKHL2 The upper/lower half-word of src1 is ’C64x 
unsigned _packlh2 (uint src7, uint src2); .PACKLH2 placed in the upper half—word the return 

value. The lower/upper half—word of src2 

is placed in the lower half—word the return 

value. 
double _rcpdp(double); RCPDP Computes the approximate 64-bit double °C67x 

reciprocal. 
float _rcpsp(float); RCPSP Computes the approximate 64-bit double °C67x 

reciprocal. 
Note: Instructions not specified with a device apply to all "C6000 devices. 
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Table 2-6. TMS320C6000 C/C++ Compiler Intrinsics (Continued) 


Assembly 
C Compiler Intrinsic Instruction Description Device 
unsigned _ rotl (uint src2, uint src7); ROTL Rotates src2 to the left by the amount in ’C64x 
srct. 
double _rsqrdp(double src); RSQRDP Computes the approximate 64-bit double ’C67x 
reciprocal square root. 
float _rsqrsp(float src); RSQRSP Computes the approximate 32-bit float re- ‘C67x 
ciprocal square root. 
int__sadd(int src7, int src2); SADD Adds src1 to src2 and saturates the result. 
long _Isadd(int src7, long src2): Returns the result. 
unsigned _saddué (uint src7, uint src2); SADDU4 Performs saturated addition between ’C64x 
pairs of 8-bit unsigned values in src1 and 
src2. 
int __sadd2 (int src7, int src2); SADD2 Performs saturated addition between ‘C64x 
int_saddus2 (uint src7, int src2); SADDUS2 pairs of 16—bit values in src1 and src2. 
Src1 values can be signed or unsigned. 
int __sat(long src2); SAT Converts a 40-bit value to an 32-bit value 
and saturates if necessary. 
uint _set(uint src2, uint csta, uint cstb); SET Sets the specified field in src2 to all 1s and 
returns the src2 value. The beginning and 
ending bits of the field to be set are speci- 
fied by csta and cstb, respectively. 
unsigned _ setr(uint, int); SET Sets the specified field in src2 to all 1s and 
returns the src2 value. The beginning and 
ending bits of the field to be set are speci- 
fied by the lower ten bits of the source reg- 
ister. 
unsigned _ shfl (uint src2); SHFL The lower 16 bits of src are placed inthe ’C64x 
even bit positions, and the upper 16 bits of 
src are placed in the odd bit positions. 
unsigned _shlmb (uint src7, uint src2); SHLMB Shifts src2 left/right by one byte, and the ‘C64x 
unsigned _shrmb (uint src7, uint src2); SHRMB most/least significant byte of src1 is 


Note: 
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Table 2-6. TMS320C6000 C/C++ Compiler Intrinsics (Continued) 


Assembly 

C Compiler Intrinsic Instruction Description Device 
int __shr2 (int src2, uint src7); SHR2 For each 16-bit quantity in src2, the quanti- *C64x 
unsigned _ shru2 (uint src2, uint src7); SHRU2 ty is arithmetically or logically shifted right 

by src1 number of bits. src2 can contain 

signed or unsigned values. 
int_smpy(int src7, int src2); SMPY Multiplies src1 by src2, left shifts the result 
int_smpyh(int src7, int src2); SMPYH by one, and returns the result. If the result 
int_smpyhl(int src7, int src2); SMPYHL is 0x80000000, saturates the result to 
int_smpylh(int src7, int src2); SMPYLH Ox7FFF FFFF. 
double _smpy2 (int src7, int src2); SMPY2 Performs 16-bit multiplication between ’C64x 

pairs of signed packed 16-bit values, with 

an additional 1 bit left-shift and saturate 

into a double result. 
int_spack2 (int src7, int src2); SPACK2 Two signed 32-bit values are saturated to ‘C64x 

16-bit values and packed into the return 

value. 


unsigned _spacku4 (int src7, int src2); SPACKU4 _ Foursigned 16-bit values are saturatedto ’C64x 
8—bit values and packed into the return 
value. 


int _spint(float); SPINT Converts 32-bit float to 32-bit signed inte- °C67x 
ger, using the rounding mode set by the 
CSR register. 


int_sshvl (int src2, int src7); SSHVL Shifts src2 to the left/right of src1 bits. Sat- °C64x 

int__sshvr (int src2, int src7); SSHVR urates the result if the shifted value is 
greater than MAX_INT or less than 
MIN_INT 

int__sshl (int src2, uint src); SSHL Shifts src2 left by the contents of src1, sat- 
urates the result to 32 bits, and returns the 
result. 

int _ssub(int src7, int src2); SSUB Subtracts src2 from src1, saturates the re- 

long _Issub(int src1, long src2): sult size, and returns the result. 

uint _sube(uint src7, uint src2); SUBC Conditional subtract divide step. 


Note: Instructions not specified with a device apply to all C6000 devices. 
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Table 2-6. TMS320C6000 C/C++ Compiler Intrinsics (Continued) 


Assembly 

C Compiler Intrinsic Instruction Description Device 

int__sub2(int src7, int src2); SUB2 Subtracts the upper and lower halves of 
src2 from the upper and lower halves of 
src1, and returns the result. Any borrowing 
from the lower half subtract does not affect 
the upper half subtract. 

int_sub4 (int src7, int src2); SUB4 Performs 2s—complement subtraction be- °C64x 
tween pairs of packed 8—bit values. 

int_subabsé4 (int src7, int src2); SUBABS4 Calculates the absolute value of the differ- ‘C64x 
ences for each pair of packed 8-bit values. 

uint _swap4 (uint src2); SWAP4 Exchanges pairs of bytes (an endian ’C64x 
swap) within each 16-bit value. 

uint _unpkhu4 (uint src2); UNPKHU4 — Unpacks the two high unsigned 8-bit val- *C64x 
ues into unsigned packed 16-bit values. 

uint _unpklu4 (uint src2); UNPKLU4 — Unpacks the two low unsigned 8-bit val- *C64x 
ues into unsigned packed 16-bit values. 

uint _xpnd2 (uint src2); XPND2 Bits 1 and 0 of src are replicated to the up- ’C64x 
per and lower halfwords of the result, re- 
spectively. 

uint _xpnd4 (uint src2); XPND4 Bits 3 through 0 are replicated to bytes 3 *C64x 


through 0 of the result. 


Note: Instructions not specified with a device apply to all ‘C6000 devices. 
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2.5.2 Wider Memory Access for Smaller Data Widths 


In order to maximize data throughput on the C6000, it is often desirable to use 
a single load or store instruction to access multiple data values consecutively 
located in memory. For example, all C6000 devices have instructions with cor- 
responding intrinsics, such as _add2(),_mpyhl( ), mpylh( ), etc, that oper- 
ate on 16-bit data stored in the high and low parts of a 32-bit register. When 
operating on a stream of 16—bit data, you can use word (32-bit) accesses to 
read two 16-bit values at a time, and then use other ’C6x intrinsics to operate 
on the data. Similarly, on the C64x and C67x devices, itis sometimes desirable 
to perform 64—bit accesses with LDDW to access two 32-bit values, four 
16-bit values, or even eight 8-bit values, depending on situation. 


For example, rewriting the vecsum( ) function to use word accesses (as in 
Example 2-11) doubles the performance of the loop. See section 5.4, Using 
Word Access for Short Data and Doubleword Access for Floating—Point Data, 
on page 5-19 for more information. This type of optimization is called packed 
data processing. 


Example 2-11. Vector Sum With restrict Keywords, MUST_ITERATE, Word Reads 


void vecsum4(short *restrict sum, restrict short *inl, 
restrict short *in2, unsigned int N) 


{ 


pio oman, 
#pragma MUST_ITERATE (10); 


for (i = 0; i < (N/2); itt) 
_—amem4 (&sum[i]) = add2(_amem4_const (&inl[i]), _amem4_const(&in2[i])); 


} 


aca | 


Note: 


The MUST_ITERATE intrinsic tells the compiler that the following loop will 
iterate at least the specified number of times. 


The _amem4 intrinsic tells the compiler that the following access is a 4—byte 
(or word) aligned access of an unsigned int beginning at address sum. The 
_amem4_const intrinsics tell the compiler that the following accesses are a 
4—byte (or word) aligned access of a const unsigned int beginning at ad- 
dresses in int and in2 respectively. 
a 
The use of aligned memory intrinsics is new to release 4.1 of the C6000 Opti- 
mizing C Compiler. Prior to this release, the method used was type-casting, 
wherein the programmer casts a pointer of a “narrow” type to a pointer of a 
“wider” type as seen in the example below. 
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Example 2-12. Example of Vector Sum with Type—Casting 


void vecsum4(short *restrict sum, restrict short *inl, 
restrict short *in2, unsigned int N) 
{ 
int i; 
const int *restrict i_inl (const int *)inl; 
const int *restrict i_in2 (const int *)ain2; 
int *restrict i_sum = (int *) sum; 


#pragma MUST_ITERATE (10); 
for (i = 0; i < (N/2); i++) 
i_sum[i] = _add2(i_inl[i], i_in2[i]); 


In this example pointers sum, in1 and in2 are cast to int*, which means that 
they must point to word—aligned data. By default, the compiler aligns all global 
short arrays on doubleword boundaries. The type casting method, though ef- 
fective, is not supported by ANSI C. In the traditional C/C++ pointer model, the 
pointer type specifies both the type of data pointed to, as well as the width of 
access to that data. With packed data processing , itis desirable to access mul- 
tiple elements of a given type with a single de—-reference as the example above 
does. Normally, de-referencing a pointer—to—type returns a single element of 
that type. Furthermore, the ANSI C standard states that pointers to different 
types are presumed to not alias (except in the special case when one pointer 
is a pointer—to—char). (See Chapter 2, Lesson One of the Compiler Tutorial for 
more information on pointer/memory aliasing). Thus, casting between types 
can thwart dependence analysis and result in incorrect code. 


In most cases, the C6000 compiler can correctly analyze the memory depen- 
dences. The compiler must disambiguate memory references in order to de- 
termine whether the memory references alias. In the case where the pointers 
are to different types (unless one of the references is to a char, as noted 
above), the compiler assumes they do not alias. Casts can break these default 
assumptions since the compiler only gets to see the type of the pointer when 
the de-reference happens, not the type of the data actually being pointed to. 
See the following example. 
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Example 2-13. Example of Casting Breaking Default Assumptions 


int test(short *x)) 


{ 


int t, *y=(int*)x; 


*x = 0; 
*y; 
return t; 


In this example, x and y are indirect references to unnamed objects via point- 
ers x and y. Those objects may or may not be distinct. According to the C stan- 
dard (section 2.4), aconforming program may not access an object of one type 
via a pointer to another type when those types have different sizes. That per- 
mits an optimizing compiler to assume that *x and *y point to distinct objects 
if it cannot prove otherwise. This assumption is often critical to obtaining high 
quality compiled code. 


In this example, the compiler is allowed to assume that *x and *y point to ob- 
jects that are independent, or distinct. Thus, the compiler could reorder the 
store to *x and the load of “y causing test() to return to the old value of *y instead 
of the new value, which is probably not what the user intended. 


Another similar example is shown below. 


Example 2-14. Example Two of Casting Breaking Default Assumptions 


test (short *x)) 


{ 


int t; 


*x = 0; 
t = *((int *)x); 
return. L} 


In this case, the compiler is allowed to assume that both x and *((int *)x) are 
independent. Therefore, the reordering of the store and load may occur as in 
Example 2-13. 


As these two examples illustrate, it is not recommended to assign a pointer of 
one type to a pointer of another type. Instead, one should use the new memory 
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intrinsics at the point of reference to allow any size load or store to reference 
a particular size pointer. The new memory intrinsics retain the type information 
for the original type while still allowing the compiler to access data at a wider 
width, so that the compiler default assumptions are no longer broken. These 
new intrinsics build upon the two intrinsics added to the 4.0 release to support 
non-aligned word and double word memory accesses (see Example 2-16). 
Below, Example 2—13 is rewritten to use the memory intrinsics. 


Example 2-15. Example 2-13 Rewritten Using Memory Access Intrinsics 


int test(short *x)) 


{ 
int t; 


= _amem4 (x); 
return t; 


In this example, _amem4 allows t to be loaded with an aligned 4—byte (word) 
value referenced by the short *x. 


Table 2-7 summarizes all the memory access intrinsics. 


Table 2-7. Memory Access Intrinsics 


(a) Double load/store 


C Compiler Intrinsic Description 

_memd8(p) unaligned access of double beginning at address p 
(existing intrinsic) 

_memd8_const(p) unaligned access to const double beginning at ad- 
dress p 

_amemd8(p) aligned access of double beginning at address p 

_amemds8_const(p) aligned access to const double beginning at address p 
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Table 2-7. Memory Access Intrinsics(Continued) 


(b) Unsigned int load/store 
C Compiler Intrinsic 


_mem4(p) 


Description 


unaligned access of unsigned int beginning at ad- 
dress p (existing intrinsic) 


_mem4_const(p) 


unaligned access to const unsigned int beginning at 
address p 


_amem4(p) 


aligned access of unsigned int beginning at address 
p 


_amem4_const(p) 


aligned access to const unsigned int beginning at 
address p 


(c) Unsigned short load/store 
C Compiler Intrinsic 


_mem2(p) 


Description 


unaligned access of unsigned short beginning at ad- 
dress p 


_mem2_const(p) 


unaligned access to const unsigned short beginning 
at address p 


_amem2(p) 


aligned access of unsigned short beginning at ad- 
dress p 


_amem2_const(p) 


aligned access to const unsigned short beginning at 
address p 


Pointer p can have any type. However, in order to allow the compiler to correct- 
ly identify pointer aliases, it is crucial that the pointer argument p to each of 
these intrinsic functions correctly identify the type of the object being pointed 
to. Thatis, if you want to fetch four shorts at at time, the argument to_memd8() 
must be a pointer to (or an array of) shorts. 


On the ’C64x, nonaligned accesses to memory are allowed in C through the 
_memé4 and _memd8 intrinsics. 
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Example 2-16. Vector Sum With Non—aligned Word Accesses to Memory 


void vecsum4a(short *restrict sum, const short *restrict inl, 
const short restrict *in2, unsigned int N) 


{ 


anit a; 


#pragma MUST_ITE 


Another consideration is that the loop must now run for an even number of it- 
erations. You can ensure that this happens by padding the short arrays so that 
the loop always operates on an even number of elements. 


If a vecsum( ) function is needed to handle short-aligned data and odd-num- 
bered loop counters, then you must add code within the function to check for 
these cases. Knowing what type of data is passed to a function can improve 
performance considerably. It may be useful to write different functions that can 
handle different types of data. If your short-data operations always operate on 
even-numbered word-aligned arrays, then the performance of your applica- 
tion can be improved. However, Example 2—17 provides a generic vecsum( ) 
function that handles all types of alignments and array sizes. 
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Example 2-17. Vector Sum With restrict Keywords, MUST_ITERATE pragma, and Word 
Reads (Generic Version) 


void vecsum5(short *restrict sum, const short *restrict inl, const short *re- 
strict in2, unsigned int N) 
{ 


IT. 


/* test to see if sum, in2, and inl are aligned to a word boundary */ 


if (((int)sum | (int)in2 | (int)inl) & 0x2) 
{ 
#pragma MUST_ITERATE (20); 

for (i = 0; i < N; i++) 

sum[i] = inl[i] + in2[i]; 

} 

else 

{ 

#pragma MUST_ITERATE (10); 

for (i = 0; i < (N/2); itt) 

_—amem4 (&sum[i]) = _add2(_amem4_const (&inl[i]), _amem4_const(&in2[i])); 


if (N & Oxl) sum[i] = inl[i] + in2[i]; 
} 
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2.5.2.1. Using Word Access in Dot Product 


Other intrinsics that are useful for reading short data as words are the multiply 
intrinsics. Example 2-18 is a dot product example that reads word-aligned 
short data and uses the _mpy( ) and_mpyh( ) intrinsics. The _mpyh( ) intrin- 
sic uses the ’C6000 instruction MPYH, which multiplies the high 16 bits of two 
registers, giving a 32-bit result. 


This example also uses two sum variables (Sum1 and sum2). Using only one 
sum variable inhibits parallelism by creating a dependency between the write 
from the first sum calculation and the read in the second sum calculation. With- 
inasmall loop body, avoid writing to the same variable, because it inhibits par- 
allelism and creates dependencies. 


Example 2-18. Dot Product Using Intrinsics 


int dotprod(const short *restrict a, const short *restrict b, unsigned int N) 
{ 

int i, suml = 0, sum2 

for (i = 0; i < (N >> 1); i++) 

{ 


, —amem4_const (&b[i 
sum2 + _mpyh(_amem4_const (&a[i]), _amem4_const (&b[i 


suml = suml + _mpy (_amem4_const (&a[i] ) 


sum2 


} 


return suml + sum2; 


} 
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2.5.2.2. Using Word Access in FIR Filter 


Example 2-19 shows an FIR filter that can be optimized with word reads of 
short data and multiply intrinsics. 


Example 2-19. FIR Filter-—Original Form 


void firl(const short x[restrict], const short h[restrict], short y[restrict], 
int n, int m, int s) 
{ 

Int 2, ji} 

long y0; 

long round = 1L << (s - 1); 


for (3 ; J < mj; jtt) 
round; 


; i < ny itt) 
Lg Alay 


Example 2-20 shows an optimized version of Example 2-19. The optimized 
version passes an int array instead of casting the short arrays to int arrays and, 
therefore, helps ensure that data passed to the function is word-aligned. As- 
suming that a prototype is used, each invocation of the function ensures that 
the input arrays are word-aligned by forcing you to insert a cast or by using int 
arrays that contain short data. 
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Example 2-20. FIR Filter— Optimized Form 


void fir2(const int x[restrict], const int h[restrict], short y[restrict], int 
n, int m, int s) 

{ 

int i, Jj; 

long yO, yl; 

long round = 1L << (s - 1); 


#pragma MUST_ITERATE (8); 
for (j = 0; j < (m >> 1); Jt) 
{ 


yO = yl = round; 
#pragma MUST_ITERATE (8); 
for (i > a <o(m >> 1)3 


{ 


} 
short x[SIZE_ 
void main() 


{ 


firl(_amem4_const (&x), _amem4_const(&h), y, n,m, 8s); 


} 
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2.5.2.3 Using Double Word Access for Word Data (’C64x and ’C67x Specific) 


The ’C64x and ’C67x families have a load double word (LDDW) instruction, 
which can read 64 bits of data into a register pair. Just like using word accesses 
to read 2 short data items, double word accesses can be used to read 2 word 
data items (or 4 short data items). When operating on a stream of float data, 
you can use double accesses to read 2 float values at a time, and then use 
intrinsics to operate on the data. 


The basic float dot product is shown in Example 2-21. Since the float addition 
(ADDSP) instruction takes 4 cycles to complete, the minimum kernel size for 
this loop is 4 cycles. For this version of the loop, a result is completed every 
4 cycles. 


Example 2-21. Basic Float Dot Product 


float dotpl(const float a[restrict], const float b[restrict]) 
{ 

int i; 

float sum = 0; 


for (i=0; i<512; i++) 
sum += ali] * b[i]; 


return sum; 


In Example 2—22, the dot product example is rewritten to use double word 
loads and intrinsics are used to extract the high and low 32-bit values con- 
tained in the 64-bit double. The _hi() and_lo() instrinsics return integer values, 
the _itof() intrinsic subverts the C typing system by interpreting an integer val- 
ue as a float value. In this version of the loop, 2 float results are computed every 
4 cycles. Arrays can be aligned on double word boundries by using either the 
DATA_ALIGN (for globally defined arrays) or DATA_MEM_BANK (for locally 
defined arrays) pragmas.Example 2—22 and Example 2-23 show these prag- 
mas. 


a a a a a a a a a a TE! | 
Note: For the pragmas that apply to functions or symbols, the syntax for 
the pragma differs between C and C++. In C, you must supply the name of 
the object or function to which you are applying the pragma as the first argu- 
ment. In C++, the name is omitted; the pragma applies to the declaration 
of the object or function that follows it. 

a 
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Example 2-22. Float Dot Product Using Intrinsics 


float dotprod2(const double a[restrict], const double b[restrict]) 
{ 

int i; 

float sum0 

float suml 


for (i1=0; i<512/2; itt) 


= _itof(_hi(a[i])) 
= _itof(_lo(a[i])) 


return sum0 + suml; 


} 
float ret_val, a[SIZE_A], b[SIZE_ 


void main() 


{ 


ret_val = dotprod2 (_amem8_const (&a), _amem8_const (&b) ); 


} 


In Example 2-23, the dot product example is unrolled to maximize perfor- 
mance. The preprocessor is used to define convenient macros FHI() and 
FLO() for accessing the high and low 32-bit values in a double word. In this 
version of the loop, 8 float values are computed every 4 cycles. 
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Example 2-23. Float Dot Product With Peak Performance 


#define FHI (a) 
#define FLO(a) 


_itof (_hi (a) 
_itof(_lo(a) 


) 
) 


float dotp3(const double a[restrict], const double b[restrict]) 


{ 


float 
float 
float 
float 
float 
float 


} 


sum0 
sum2 
sum4 
sum6 
sum0 


sum4 += 


ame; aki 
sum0 = 


suml 
sum2 
sum3 
sum4 
sum5 
sum6 
sum7 


sum8 = 


ee ee ee ee 


OS) Oo OOO) O-@ 


~e 


i1<512; it+=4) 


Rye Ry Ry yay 


ee ee a 
WWNNBPHEYY 


= suml; 
sum3; 
= sum5; 
= sum7; 
sum2; 


sum6; 


return sum0 + sum4; 


} 


void main() 


{ 


+ + FF F FF F 
Ry Ry ey ey ey ey hy 


/* Using 0 as the bank parameter for the DATA_MEM_BANK */ 
/* pragma aligns variable to a double word boundary for */ 


/* the C62xx, 


CO4xx, 


#pragma DATA_MEM_BANK (a, 


#pragma DATA_MEM_BANK 


float ret_val, a[SIZE_. 


ret_val = 


(b, 
Al, 


and C67xx. */ 


0); 
Oi 


b[SIZI 


EB); 


dotp3 (_amemd8_const (&a), _amemd8_const (double*)b); 
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In Example 2—24, the dot product example has been rewritten for c64xx. This 
demonstrates how it is possible to perform doubleword nonaligned memory 
reads on a dot product that always executes a multiple of 4 times. 


Example 2-24. Int Dot Product with Nonaligned Doubleword Reads 


int dotp4(const short *restrict a, const short *restrict b, unsigned int N) 


{ 


int i, suml 0, 
for (i = 


{ 


_1o(_memd8_const 
_1o(_memd8_const 
_hi (_memd8_const 
_hi (_memd8_const 


return suml + sum2 + sum3 + 
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2.5.2.4 Using _nassert(), Word Accesses, and the MUST_ITERATE pragma 


It is possible for the compiler to automatically perform packed data optimiza- 
tions for some, but not all loops. By either using global arrays, or by using the 
_nassert() intrinsic to provide alignment information about your pointers, the 
compiler can transform your code to use word accesses and the ‘C6000 intrin- 
sics. 


Example 2-25 shows how the compiler can automatically do this optimization. 


Example 2-25. Using the Compiler to Generate a Dot Product With Word Accesses 


int dotprodl(const short *restrict a, const short *restrict b, unsigned int N) 
{ 

int i, sum = 0; 

/* a and b are aligned to a word boundary */ 

_nassert(((int) (a) & 0x3) == 0); 

_nassert(((int) (bo) & 0x3) == 0); 


#pragma MUST_ITERATE (40, 40); 
for (i = 0; i < Nj; i++) 

sum += a[i] * b[il]; 

return sum; 


Compile Example 2—25 with the following options: —o -k. Open up the assem- 
bly file and look at the loop kernel. The results are the exact same as those 
produced by Example 2-18. The first 2 _nassert() intrinsics in Example 2-25 
tell the compiler that the arrays pointed to by a and b are aligned to a word 
boundary, so it is safe for the compiler to use a LDW instruction to load two 
short values. The compiler generates the _mpy() and _mpyh() intrinsics inter- 
nally as well as the two sums that were used in Example 2-18 (shown again 
below). 
int dotprod(const short *restrict a, const short *re 
strict b, unsigned int N) 


{ 


int i, suml = 0, sum2 = 0; 
for (i = 0; i < (N >> 1); i++) 
{ 
suml = suml + _mpy (_amem4_const (&éa[i]), 
_amem4_const (éb[i])); 
sum2 = sum2 + _mpyh (_amem4_const (&a[i]) 
bil) 


’ 
_amem4_const (& ); 


return suml + sum2; 
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You need some way to convey to the compiler that this loop will also execute 
an even number of times. The MUST_ITERATE pragma conveys loop count 
information to the compiler. For example, #pragma MUST_ITERATE (40, 40), 
tells the compiler the loop immediately following this pragma will execute a 
minimum of 40 times (the first argument), and a maximum of 40 times (the sec- 
ond argument). An optional third argument tells the compiler what the trip 
count is a multiple of. See the TMS320C6000 C/C++ Compiler User’s Guide 
for more information about the MUST_ITERATE pragma. 


Example 2-26 and Example 2-27 show how to use the _nassert() intrinsic 
and MUST_ITERATE pragma to get word accesses on the vector sum and the 
FIR filter. 


Example 2-26. Using the _nassert() Intrinsic to Generate Word Accesses for Vector Sum 


void vecsum(short *restrict sum, const short *restrict inl, 
const short *restrict in2, unsigned int N) 
{ 
int i; 
_nassert 
_nassert 


((int)sum & 0x3) 
((int)inl & 0x3) 
((int)in2 & 0x3) 
UST_ITERATE (40, 
0; 


( 
( 
_nassert ( 
#pragma M 
for (i = 
sum[i] 


i < N; itt) 
inl[i] + in2[i]; 


void fir 


int n, 


{ ae, acy 


long yO; 


long round 


_nassert ( 
_nassert ( 
_nassert ( 
for (j = 


((( 
(( 
(( 
0; 


int m, 


a 


’ 


= 1L << 
int)x & 0x3) 
(int)h & 0x3) 
(int)y & 0x3) 
3 < m; jtt) 


(const short x[restrict], 
int s) 


const short h[restrict], 
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Example 2-27. Using _nassert() Intrinsic to Generate Word Accesses for FIR Filter 


short y[restrict] 


{ 
yO = round; 
#pragma MUST_ITERAT 
for (i = 0; i <n; 
yO += x[i + 4] * hlil; 
(int) (yO >> s); 


As you can see from Example 2-27, the optimization done by the compiler is 
not as optimal as the code produced in Example 2-20, but it is more optimal 
than the code in Example 2-19. 


Example 2-28. Compiler Output From Example 2-27 


ED LOOP KE 


A9,A7:A6,A7: 
A3,B3,B2 
B3,A0,A0 

L3 
*++B9(8),B3 
*+A8 (4) ,A3 


B3,B5:B4,B95: 
AO,B1,A9 
*+B8 (4) ,B3 
*+A8 (6) ,A0 


BO,1,BO 
B2,B7:B6,B7: 
AO,A5:A4,A5: 
B1,B3,B3 


+B8 (8),Bl 
tA8 (8) ,A0 


@@@| 21) 


Al,1,Al1 
*4 
#4 @@@| 21] 


Optimizing C/C++ Code 2-49 


Refining C/C++ Code 


Example 2-29. Compiler Output From Example 2-20 


ED LOOP : 
ADD a B3,B5:B4,B5:B4 
ADD gail A3,A5:A4,A5:A4 
MV . Bl,B2 

B1,A8,B3 
B1,A8,A3 

3 

*B8,Bl1 

BO,1,B0O 
A3,A7:A6,A7:A6 
B3,B7:B6,B7:B6 
B2,A8,A3 
A8,B9,B3 
Al,1,Al1 
*A0++,A8 
*++B8,B9 


Kk 
ie) 
K 


< 
‘Ud 
K 
an 
= 


NOE EP PrPwuwew 


Cc 


E 


ERNEL 
.S1 A2,1,A2 

UL A5,A1:A0,A1:A0 
.M1X B5,A4,A5 

.S2 L4 
.L2 BO,1,B0 
.D11 *A3++,A4 
.D27 *B4++,B5 


Note: The_nassert() intrinsic may not solve all of your short to int or float- 
to-double accesses, but it can be a useful tool in achieving better perfor- 


mance without rewriting the C code. 
5555595955955 9959S SS] 
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If your code operates on global arrays as in Example 2—31, and you build your 
application with the -pm and -03 options, the compiler will have enough infor- 
mation (trip counts and alignments of variables) to determine whether or not 
packed-data processing optimization is feasible. 


Example 2-31. Automatic Use of Word Accesses Without the _nassert Intrinsic 


<filel.c> 
int dotp (short *restrict a, short *restrict b, int 
{ 
int sum = 0, i; 
for (1 = 0; 1 < c; i++) sum += ali] * Bla]; 
return sum; 
} 
<file2.c> 
#include <stdio.h> 
short x[40] = { ly 2, 3% 4, By Gy Ty oy Uy 
Ti. 2) 13, 24, 257. Lo, Ty Dey. 29, 2:0), 
421, 225 235 24, 25, 26, 277. 28; 29%: 30,4 
31, 32, 33, 34, 35, 36, 37, 38, 39, 40 }; 
short y[40] = { 40, 39, 38, 37, 36, 35, 34, 33, 32, 
30, 29; 28, 27, 26, 25, 24; 23, 22;. 21, 
201, 9, 165. Li; 6s. Do, 14; 2sy dea, TL, 
10, 9, 8) Ty 6, Sy 4) 3, 2, 1 33 
void main() 
{ 
int z; 
z = dotp(x, y 
printf(“z = % 


, 40); 
a\n 2) 
} 


Compile filel.c and file2.c with: 
cl6x -pm -o3 -k filel.c file2. 
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Below is the resulting assembly file (file1 .asm). Notice that the dot product loop 
uses word accesses and the ‘C6000 intrinsics. 


L2: ; PIPED LOOP KERNEL 

[!Al] ADD .L2 B6,B7,B7 

[!Al] ADD Lid A6,A0,A0 
MPY .M2X B5,A4,B6 
MPYH .MLX B5,A4,A6 

[ BO] B .S1 L2 
LDW .DIT1 *+A5(4),A4 
LDW .D2T2 *+B4(4),B6 

[ Al] SUB iil Al,1,Al 

[!Al] ADD .S2 B5,B8,B8 

[!Al] ADD Bia A6,A3,A3 
MPY .M2X B6,A4,B5 
MPYH .MLX B6,A4,A6 

[ BO] SUB .L2 BO, 1,B0 
LDW .DIT1  *++A5(8),A4 
LDW .D2T2  *++B4(8),B5 
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2.5.3 Software Pipelining 


Software pipelining is a technique used to schedule instructions from a loop 
so that multiple iterations of the loop execute in parallel. When you use the —o2 
and —o3 compiler options, the compiler attempts to software pipeline your 
code with information that it gathers from your program. 


Figure 2-2 illustrates a software-pipelined loop. The stages of the loop are 
represented by A, B, C, D, and E. In this figure, a maximum of five iterations 
of the loop can execute at one time. The shaded area represents the loop ker- 
nel. In the loop kernel, all five stages execute in parallel. The area immediately 
before the kernel is known as the pipelined-loop prolog, and the area immedi- 
ately following the kernel is known as the pipelined-loop epilog. 


Figure 2-2. Software-Pipelined Loop 


Al 
B1 A2 
C1 B2 A3 Pipelined-loop prolog 


D1 C2 B3 A4 


E1 D2 C3 B4 AS Kernel 
Ee D3 C4 BS 


E3 D4 C5 Pipelined-loop epilog 
E4 D5 
E5 


Because loops present critical performance areas in your code, consider the 
following areas to improve the performance of your C code: 


Trip count 

Redundant loops 
Loop unrolling 
Speculative execution 


UHOUUU 


Optimizing C/C++ Code 2-53 


Refining C/C++ Code 


2.5.3.1 Trip Count Issues 


A trip count is the number of loop iterations executed. The trip counter is the 
variable used to count each iteration. When the trip count reaches a limit equal 
to the trip count, the loop terminates. 


If the compiler can guarantee that at least rn loop iterations will be executed, 
then nis the known minimum trip count. Sometimes the compiler can deter- 
mine this information automatically. Alternatively, the user can provide this in- 
formation using the MUST_ITERATE and PROB_ITERATE pragma. For more 
information about pragmas, see the TMS320C6000 Optimizing C/C++ Com- 
piler User’s Guide (SPRU187). 


The minimum safe trip count is the number of iterations of the loop that are nec- 
essary to safely execute the software pipelined version of the loop. 


All software pipelined loops have a minimum safe trip count requirement. If the 
known minimum trip countis not above the minimum safe trip count, redundant 
loops will be generated. 


The known minimum trip count and the minimum safe trip count for a given 
software pipelined loop can be found in the compiler-generated comment 
block for that loop. 


In general, loops that can be most efficiently software pipelined have loop trip 
counters that count down. In most cases, the compiler can transform the loop 
to use a trip counter that counts down even if the original code was not written 
that way. 


For example, the optimizer at levels —-o2 and —o3 transforms the loop in 
Example 2—32(a) to something like the code in Example 2—32(b). 


Example 2-32. Trip Counters 


(a) Original code 


(i = 0; i < N; itt+) /* i = trip counter, N = trip count */ 


(b) Optimized code 


i--) /* Downcounting trip counter */ 
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2.5.3.2 Eliminating Redundant Loops 


Sometimes the compiler cannot determine if the loop always executes more 
than the minimum safe trip count. Therefore, the compiler will generate two 
versions of the loop: 


_j) An unpipelined version that executes if the trip count is less than the mini- 
mum safe trip count. 


_) A software-pipelined version that executes if the trip count is equal to or 
greater than the minimum safe trip count. 


Obviously, the need for redundant loops will hurt both codesize and to a lesser 
extent, performance. 


To indicate to the compiler that you do not want two versions of the loop, you 
can use the -ms0 or -ms1 option. The compiler will generate the software pipe- 
lined version of the loop only if it can prove the minumum trip count will always 
be equal or greater than the effective minimum trip count of the software pipe- 
lined version of the loop. Otherwise, the non pipelined version will be gener- 
ated. In order to help the compiler generate only the software pipelined version 
of the loop, use the MUST_ITERATE pragma and/or the -pm option to help the 
compiler determine the known minimum trip count. 


| i rr a a ea | 
Note: Use of -ms0 or -ms1 may result in a performance degredation 


Using -ms0 or -ms1 may cause the compiler not to software pipeline a loop. 


This can cause the performance of the loop to suffer. 
| | 


When safe, the —mh option may also be used to reduce the need for a redun- 
dant loop. The compiler performs an optimization called prolog/epilog collaps- 
ing to reduce code size of pipelined loops. In particular, this optimization in- 
volves rolling the prolog and/or epilog (or parts thereof) back into the kernel. 
This can result in a major code size reduction. This optimization can also re- 
duce the minimum trip count needed to safely execute the software-pipelined 
loop, thereby eliminating the need for redundant loops in many cases. 


The user can increase the compiler’s ability to perform this optimization by us- 
ing the -mh, or -mhn option whenever possible. See the TMS320C 6000 Opti- 
mizing C/C++ Compiler User’s Guide for more information about options. 
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2.5.3.3. Communicating Trip-Count Information to the Compiler 


When invoking the compiler, use the following options to communicate trip- 
count information to the compiler: 


[J Use the—-o3 and—pm compiler options to allow the optimizer to access the 
whole program or large parts of it and to characterize the behavior of loop 
trip counts. 


(j Use the MUST_ITERATE pragma to help reduce code size by preventing 
the generation of a redundant loop or by allowing the compiler (with or 
without the —ms option) to software pipeline innermost loops. 


You can use the MUST_ITERATE and PROB_ITERATE pragma to convey 
many different types of information about the trip count to the compiler. 


(J) The MUST_ITERATE pragma can convey that the trip count will always 
equal some value. 


/* This loop will always execute exactly 30 times */ 


#pragma MUST_ITERATE (30, 30); 


for (j = 0; 3 < x; Jj+t) 


(J TheMUST_ITERATE pragma can convey that the trip count will be great- 
er than some minimum value or smaller than some maximum value. The 
latter is useful when interrupts need to occur inside of loops and you are 
using the -mi<n> option. Refer to section 8.4, Interruptible Loops. 


/* This loop will always execute at least 30 times */ 


#pragma MUST_ITERATE (30); 


for (j = 0; J < x; Jtt) 


(J The MUST_ITERATE pragma can convey that the trip count is always di- 
visible by a value. 


/* The trip count will execute some multiple of 4 times */ 


#pragma MUST_ITERATE (,, 4); 


for (j = 0; 3 < x} j+t 


This information call all be combined as well into a single C statement: 


#pragma MUST_ITERATE (8, 48, 8); 


for (j = 0; Jj < x; Jtt) 


The compiler knows that this loop will execute some multiple of 8 (between 8 
and 48) times. This information is useful in providing more information about 
unrolling a loop or the ability to perform word accesses on a loop. 
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Several examples in this chapter and in section 8.4.4 show all of the different 
ways that the MUST_ITERATE pragma and _nassert intrinsic can be used. 


The _nassert intrinsic can convey information about the alignment of pointers 
and arrays. 
void vecsum(short *restrict a, const short *restrict b, 


const short *restrict c) 
{ 


_nassert(((int) a & 0x3) == 0); /* a is word aligned */ 

_nassert(((int) b & 0x3) == 0); /* b is word aligned */ 

_nassert(((int) c & 0x7) == 0); /* c is double word 
aligned */ 


} 


See the TMS320C6000 Optimizing C/C++ Compiler User’s Guide for a com- 
plete discussion of the —ms, —03, and—pm options, the _nassert intrinsic, and 
the MUST_ITERATE and PROB_ITERATE pragmas. 


2.5.3.4 Loop Unrolling 


Another technique that improves performance is unrolling the loop; that is, ex- 
panding small loops so that each iteration of the loop appears in your code. 
This optimization increases the number of instructions available to execute in 
parallel. You can use loop unrolling when the operations in a single iteration 
do not use all of the resources of the C6000 architecture. 


There are three ways loop unrolling can be performed: 

1) The compiler may automatically unroll the loop. 

2) Youcan suggest that the compiler unroll the loop using the UNROLL pragma. 
3) You can Unroll the C/C++ code yourself 


In Example 2—33, the loop produces a new sun(i] every two cycles. Three 
memory operations are performed: a load for both in1[i] and in2[i] and a store 
for sum[i]. Because only two memory operations can execute per cycle, two 
cycles are necessary to perform three memory operations. 


Example 2-33. Vector Sum With Three Memory Operations 


void vecsum2 (short *restrict sum, const short *restrict inl, const short *re- 
strict in2, unsigned int N) 
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The performance of a software pipeline is limited by the number of resources 
that can execute in parallel. In its word-aligned form (Example 2-34), the vec- 
tor sum loop delivers two results every two cycles because the two loads and 
the store are all operating on two 16-bit values at a time. 


Example 2-34. Word-Aligned Vector Sum 


void vecsum4(short *restrict sum, const short *restrict inl, 
const short *restrict in2, unsigned int N) 
{ 
int i; 
#pragma MUST_ITERATE (10); 
for (i = 0; i < (N/2); itt) 


{ 


_amem4 (&sum[i]) = _add2(_amem4_const (&in[1]), _amem4_const (&in2[i])); 


} 


If you unroll the loop once, the loop then performs six memory operations per 
iteration, which means the unrolled vector sum loop can deliver four results ev- 
ery three cycles (that is, 1.33 results per cycle). Example 2-35 shows four re- 
sults for each iteration of the loop: sum[i] and sum[i+sz] each store an int value 
that represents two 16-bit values. 


Example 2-35 is not simple loop unrolling where the loop body is simply repli- 
cated. The additional instructions use memory pointers that are offset to point 
midway into the input arrays and the assumptions that the additional arrays are 
a multiple of four shorts in size. 


Example 2-35. Vector Sum Using const Keywords, MUST_ITERATE pragma, Word 
Reads, and Loop Unrolling 


void vecsum6(int *restrict sum, const int *restrict inl, const int *restrict 
in2, unsigned int N) 
{ 

int 2; 

int sz =N >> 2; 


#pragma MUST_ITERATE (10); 


for 1 , 2 < sige a+) 


sum[i] = _add2(inl[i], in2[i]); 
sum[i+sz] = _add2(inl[itsz], in2[it+sz]); 
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Software pipelining is performed by the compiler only on inner loops; there- 
fore, you can increase performance by creating larger inner loops. One meth- 
od for creating large inner loops is to completely unroll inner loops that execute 
for a small number of cycles. 


In Example 2—36, the compiler pipelines the inner loop with a kernel size of one 
cycle; therefore, the inner loop completes a result every cycle. However, the 
overhead of filling and draining the software pipeline can be significant, and 
other outer-loop code is not software pipelined. 


Example 2-36. FIR_Type2— Original Form 


void fir2(const short input[restrict], const short coefs[restrict], short 
out [restrict] ) 
{ 

Int. ‘2, 39 

int sum = 0; 


for (i O; i < 40; i++) 


for (j = 0; 3 < 16; J++) 
sum += coefs[j] * input[i + 15 - 4]; 


out[i] = (sum >> 15); 


For loops with a simple loop structure, the compiler uses a heuristic to deter- 
mine if it should unroll the loop. Because unrolling can increase code size, in 
some cases the compiler does not unroll the loop. If you have identified this 
loop as being critical to your application, then unroll the inner loop in C code, 
as in Example 2-37. 


In general unrolling may be a good idea if you have an uneven partition or if 
your loop carried dependency bound is greater than the partition bound. (Refer 
to section 5.7, Loop Carry Paths and section 3.2 in the TMS320C6000 Opti- 
mizing C/C++ Compiler User’s Guide. This information can be obtained by us- 
ing the —mw option and looking at the comment block before the loop. 
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Example 2-37. FIR_Type2—Inner Loop Completely Unrolled 


void fir2_u(const short input[restrict], const short coefs[restrict], short 
out [restrict] ) 
{ 

int i, J; 

int sum; 


for (i = 0; i < 40; i++) 


{ 


coefs[0] * input[ 5 
coefs input 
coefs input 
coefs input 
coefs input 
coefs input 
coefs input 
coefs input 
coefs input 
coefs input 
coefs 
coefs 
coefs 
coefs 
coefs 
coefs 


b+ + 
a 
NON Wss 


+ +4 
oO +t 
Se wa a Sik Si 


b+ + 


+ + + + FF F F F 


t+ttt+tt+t++4 

f 

+ 
DIOOFRPRPEE 


+ 
t 1 2t ort 
OrRNWHA UYU 


bot 


++ 


Ne Ne Ne Ne Ne Ne 


Now the outer loop is software-pipelined, and the overhead of draining and fill- 
ing the software pipeline occurs only once per invocation of the function rather 
than for each iteration of the outer loop. 


The heuristic the compiler uses to determine if it should unroll the loops needs 
to know either of the following pieces of information. Without knowing either 
of these the compiler will never unroll a loop. 


(1 The exact trip count of the loop 
(1 The trip count of the loop is some multiple of two 


The first requirement can be communicated using the MUST_ITERATE prag- 
ma. The second requirement can also be passed to the compiler through the 
MUST_ITERATE pragma. In section 2.5.3.3, Communicating Trip-Count In- 
formation to the Compiler, itis explained that the MUST_ITERATE pragmacan 
be used to provide information about loop unrolling. By using the third argu- 
ment, you can specify that the trip count is a multiple or power of two. 


#pragma MUST_ITERATE (n,n, 2); 
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Example 2-38 shows how the compiler can perform simple loop unrolling of 
replicating the loop body. The MUST_ITERATE pragma tells the compiler that 
the loop will execute an even number of 20 or more times. This compiler will 
unroll the loop once to take advantage of the performance gain that results 
from the unrolling. 


Example 2-38. Vector Sum 


void vecsum(short *restrict a, const short *restrict b, const short *restrict 
CG, ant n) 
{ 

dint: a; 

#pragma MUST_ITERATE (20, , 

for {1 = QO; 1 < mp a+) ali] 


<compiler output for above code> 
; PIPED LOOP KE EL 


D 7 B7,A3,A3 

12 

*++A4 (4) ,A3 
*++B4 (4) ,B7 


A3,*++A0 (4) 
B6,A5,B6 
*+B4 (2) ,B6 


1,1,Al 
6, *++B5 (4) 
0,1,B0 
+A4(2),A5 


Note: When the interrupt threshold option is used, unrolling can be used 
to regain lost performance. Refer to section 8.4.4 Getting the Most Perfor- 
mance Out of Interruptible Code. 


| | 


If the compiler does not automatically unroll the loop, you can suggest that the 
compiler unroll the loop by using the UNROLL pragma. See the 
TMS320C6000 Optimizing C/C++ Compiler User’s Guide for more informa- 
tion. 
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2.5.3.5 Speculative Execution (-mh option) 


The —mh option facilitates the compiler’s ability to remove prologs and epilogs. 
Indirectly, it can reduce register pressure. With the possibility of reducing epi- 
log code or elimination of redundant loops, use of this option can lead to better 
code size and performance. This option may cause a loop to read past the end 
of an array. Thus, the user assumes responsibility for safety. For a complete 
discussion of the -mh option, including how to use it safely, see the 
TMS320C6000 Optimizing C/C++ Compiler User’s Guide. 


2.5.3.6 What Disqualifies a Loop from Being Software-Pipelined 


In a sequence of nested loops, the innermost loop is the only one that can be 
software-pipelined. The following restrictions apply to the software pipelining 
of loops: 


L) 


If a register value is live too long, the code is not software-pipelined. See 
section 5.6.6.2, Live Too Long, on page 5-67 and section 5.10, Live-Too- 
Long Issues, on page 5-101 for examples of code that is live too long. 


If the loop has complex condition code within the body that requires more 
than the five C6000 condition registers on the ’C62x and ’C67x, or six con- 
dition registers for the ’C64x, the loop is not software pipelined. Try to elim- 
inate or combine these conditions. 


Although a software pipelined loop can contain intrinsics, it cannot contain 
function calls, including code that will call the run-time support routines. 
The exceptions are function calls that will be inlined. 


for (i = 0; i < 100; i++) 
x[i] = x[i] % 5; 


This will call the run-time support _remi routine. 


In general, you should not have a conditional break (early exit) in the loop. 
You may need to rewrite your code to use if statements instead. In some, 
but not all cases, the compiler can do this automatically. Use the if state- 
ments only around code that updates memory (stores to pointers and ar- 
rays) and around variables whose values calculated inside the loop and 
are used outside the loop. 
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In the loop in Example 2-39, there is an early exit. If distO or dist1 is less than 
distance, then execution breaks out of the loop early. If the compiler could not 
perform transformations to the loop to software pipeline the loop, you would 
have to modify the code. Example 2—40 shows how the code would be modi- 
fied so the compiler could software pipeline this loop. In this case however, the 
compiler can actually perform some transformations and software pipeline this 
loop better than it can the modified code in Example 2—40. 


Example 2-39. Use of If Statements in Float Collision Detection (Original Code) 


int colldet (const float *restrict x, const float *restrict p, float point, 
float distance) 
{ 
int I, retval = 
float sum0, suml, dist0O, distl; 
for (I = 0; I < (28 * 3); I += 6) 


{ 

sum0 x[I+0]*p[0] + x[I+1]*p[1] 
sum x[I+3]*p[0] + x[I+4]*p[1] 
distO = sum0O - point; 

distl suml —- point; 

dist0 fabs (dist0); 

distl fabs (distl); 

if (distO < distance) 


retval = (int)&x[I + 0]; 
break; 


(distl distance) 


retval (int)éx[I + 3]; 
break; 


} 


return retval; 
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Example 2-40. Use of If Statements in Float Collision Detection (Modified Code) 


int colldet_new(const float *restrict x, const float *restrict p, float 
point, float distance) 


{ 


int I, retval = 0; 

float sum0, suml, dist0O, distl; 
for = 0; I < (28 * 3); I += 6) 
{ 


sum0 I+0]*p[0] at 
suml I+3]*p[0O] + x[I+ 


dist0O sum0O - point; 

distl suml —- point; 

dist0 fabs (dist0); 

dastl fabs (distl); 

if ((dist0<distance) &&!retval) retval 
if ((distl<distance) &&!retval) retval 


} 


return retval; 


(J The loop cannot have an incrementing loop counter. Run the optimizer 
with the —o2 or —03 option to convert as many loops as possible into down- 
counting loops. 


(1 Ifthe trip counter is modified within the body of the loop, it typically cannot 
be converted into a downcounting loop. If possible, rewrite the loop to not 
modify the trip counter. For example, the following code will not software 
pipeline: 


for (i = 0; i < nj; itt) 


(1 Aconditionally incremented loop control variable is not software pipelined. 
Again, if possible, rewrite the loop to not conditionally modify the trip count- 
er. For example the following code will not software pipeline: 

for (i = 0; i < x; itt) 
{ 
if (b a 
i t= 2 


) 


} 


(1 Ifthe code size is too large and requires more than the 32 registers in the 
‘C62x and ’C67x, or 64 registers on the ’C64x, it is not software pipelined. 
Either try to simplify the loop or break the loop up into multiple smaller 
loops. 


Chapter 3 


Compiler Optimization Tutorial 


This chapter walks you through the code development flow and introduces you 
to compiler optimization techniques that were introduced in Chapter 1. It uses 
step-by-step instructions and code examples to show you how to use the soft- 
ware development tools in each phase of development. 


Before you start this tutorial, you should install Code Composer Studio. 


The sample code that is used in this tutorial is included on the code generation 
tools and Code Composer Studio CD-ROM. When you install your code gen- 
eration tools, the example code is installed in c:\ti\tutorial\sim62xx\optimiz- 
ing_c. Use the code in that directory to go through the examples in this chapter. 


The examples in this chapter were run on the most recent version of the soft- 
ware development tools that were available as of the publication of this book. 
Because the tools are being continuously improved, you may get different re- 
sults if you are using a more recent version of the tools. 
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Introduction: Simple C Tuning 


3.1. Introduction: Simple C Tuning 


The ‘C6000 compiler delivers the industry’s best ”out of the box” C perfor- 
mance. In addition to performing many common DSP optimizations, the 
’C6000 compiler also performs software pipelining on various MIPS intensive 
loops. This feature is important for any pipelined VLIW machine to perform. In 
order to take full advantage of the eight available independent functional units, 
the dependency graph of every loop is analyzed and then scheduled by soft- 
ware pipelining. The more information the compiler gathers about the depen- 
dency graph, the better the resulting schedule. Because of this, the *C6000 
compiler provides many features that facilitate sending information to the com- 
piler to *tune” your C code. 


These tutorial lessons focus on four key areas where tuning your C code can 
offer great performance improvements. In this tutorial, a single code example 
is used to demonstrate all four areas. The following example is the vector 
summation of two weighted vectors. 


Example 3-1. Vector Summation of Two Weighted Vectors 


void lesson_c(short *xptr, short *yptr, short *zptr, short *w_sum, int N) { 
int i, w_vecl, w_vec2; 
short wl,w2; 


>> 15% 


3.1.1. Project Familiarization 


In order to load and run the provided example project, you must select the ap- 
propriate target from Code Composer Setup. The c_tutorial project was built 
and saved as a CCS project file (c_tutorial.pjt). This project assumes a C62x 
fast simulator little endian target. Therefore, you need to import the same tar- 
get from Code Composer Setup: 


Set Up Code Composer Studio for C62x Fast Simulator Little Endian 
1) Click on Setup CCStudio to setup the target. 


2) From the import configuration window, select C62xx Fast Sim Ltl Endian. 
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3) Click on the ”Add to system configuration” button. 
4) Click on the close button and exit setup. 

5) Save the configuration on exit. 

Load the Tutorial Project 

6) Start Code Composer Studio. 


7) From the Project menu, select Open. 


Browse to: ti\tutorial\sim62xx\optimizing_c\ 

8) Select c_tutorial.pjt , and click Open. 

Build tutor.out 
From the Project menu, select Rebuild All. 

Load tutor.out 

1) From the File menu, choose Load Program. 
Browse to ti\tutorial\sim62xx\optimizing_c\debug\ 


2) Select tutor.out, and click Open to load the file. 


The disassembly window with a cursor at c_int00 is displayed and high- 
lighted in yellow. 


Profile the c_tutorial project 


1) From the menu bar, select Profiler—>Enable Clocks. 


The Profile Statistics window shows profile points that are already set up 
for each of the four functions, tutor1—4. 


2) From the menu bar, select Debug—>Run. 


This updates the Profile Statistics and Dis-Assembly window. You can 
also click on the Run icon, or F5 key to run the program. 


3) Click on the location bar at the top of the Profile Statistics window. 


The second profile point in each file (the one with the largest line number) con- 
tains the data you need. This is because profile points (already set up for you 
at the beginning and end of each function) count from the previous profile 
point. Thus, the cycle count data of the function is contained in the second pro- 
file point. 
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You can see cycle counts of 414, 98, 79, and 55 for functions in tutor1—4, run- 
ning on the C62xx simulator. Each of these functions contains the same C 
code but has some minor differences related to the amount of information to 
which the compiler has access. 


The rest of this tutorial discusses these differences and teaches you how and 
when you can tune the compiler to obtain performance results comparable to 
fully optimized hand—coded assembly. 


3.1.2 Getting Ready for Lesson 1 


Compile and rerun the project 


1) From Project menu, choose Rebuild All, or click on the Rebuild All icon. 


Allof the files are built with compiler options, —-gp —k —g —mh—03 —fr C:\ti\tu- 
torial\sim62xx\optimizing_c. 


2) From the file menu, choose Reload Program. 


This reloads tutor.out and returns the cursor to c_intOO. 


3) From the Debug menu, choose Run, or click the Run icon. 


The count in the Profile Statistics window now equals 2 with the cycle 
counts being an average of the two runs. 


4) Right-click in the Profile Statistics window and select clear all. 


This clears the Profile Statistics window. 
5) From the Debug menu, select Reset DSP. 
6) From the Debug menu, select Restart. 


This restarts the program from the entry point. You are now ready to start 
lesson 1. 


3.2 Lesson 1: 
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Loop Carry Path From Memory Pointers 


Open lesson_c.c 


In the Project View window, right-click on lesson_c.c and select Open. 


Example 3-2. lesson_c.c 


void lesson_c(short *xptr, 
int i, w_vecl, w_vec2; 


short wl,w2; 


wl 
w2 


short *yptr, short *zptr, short *w_sum, int N) { 


, 


i < N; i++) { 
xptr [a] * wily 
yptr[i] * w2; 


(w_vecl + w_vec2) >> 15; 


Compile the project and analyze the feedback in lesson_c.asm 


When you rebuilt the project in Getting Ready for Lesson 1, each file was com- 
piled with -k —gp —mh —03. Because option —-k was used, a *.asm file for each 
*\c file is included in the rebuilt project. 

1) From, the File menu, choose File —> Open. From the Files of Type drop— 
down menu, select *.asm. 

2) Select lesson_c.asm and click Open. 

Each .asm file contains software pipelining information. You can see the 
results in Example 3-3, Feedback From lesson_c.asm: 
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Example 3-3. Feedback From lesson_c.asm 


SOFTWARE PIPELINE INFORMATION 


Known Minimum Trip Count 

Known Max Trip Count Factor 
Loop Carried Dependency Bound (‘%) 
Unpartitioned Resource Bound 
Partitioned Resource Bound (*) 
Resource Partition: 


A-sid 

units 0 

units 1 

units 2* 

units 

cross paths 
.T address paths 
Long read paths 
Long write paths 
Logical ops (.LS) 
Addition ops (.LSD) 
Bound(.L .S .LS) 
Bound(.Di .S. .D .LS .LSD) ae 


~b or .S unit) 
-L or .S or .D unit) 


PPRPODOOFOREHEE 


Searching for software pipeline schedule at ... 
ii = 10 Schedule found with 1 iterations in parallel 
done 


Collapsed epilog stages 
Collapsed prolog stages 


Minimum safe trip count 


SINGL 


*DRA4 
*BA4 
2 
BO, 1 
C17 
AO,A5,A0 
B6,B5,B6 
1 
B6,A0,A0 
AO,15,A0 
AO, *A3++ 
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A schedule with ii = 10, implies that each iteration of the loop takes ten cycles. 
Obviously, with eight resources available every cycle on such a small loop, we 
would expect this loop to do better than this. 


Q Where are the problems with this loop? 
A A closer look at the feedback in lesson_c.asm gives us the answer. 


Q Why did the loop start searching for a software pipeline at ii=10 (for a 
10-cycle loop)? 


A The first iteration interval attempted by the compiler is always the maximum 
of the Loop Carried Dependency Bound and the Partitioned Resource Bound. 
In such a case, the compiler thinks there is a loop carry path equal to ten 
cycles: 


;* Loop Carried Dependency Bound(*) : 10 
The * symbol is interspersed in the assembly output in the comments of each 
instruction in the loop carry path, and is visible in lesson_c.asm. 


Example 3-4. lesson_c.asm 


ED LOOP KE 


DH .D1T ++, A0 
DH .D2T t+, B6 


NOP 
UB : BO,1,B0 
L2 


AO,A5,A0 
B6,B5,B6 


1 

Be, A0,A0 
AO,15,A0 
AO, *A3++ 


You can also use a dependency graph to analyze feedback, for example: 


Q Why is there a dependency between STH and LDH? They do not use any 
common registers so how can there be a dependency? 


A If we look at the original C code in lesson_c.c, we see that the LDHs corre- 
spond to loading values from xptr and yptr, and the STH corresponds to storing 
values into w_sum array. 


Q Is there any dependency between xptr, yptr, and w_sum? 
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A If all of these pointers point to different locations in memory there is no de- 
pendency. However, if they do, there could be a dependency. 


Because all three pointers are passed into lesson_c, there is no way for the 
compiler to be sure they don’t alias, or point to the same location as each other. 
This is amemory alias disambiguation problem. In this situation, the compiler 
must be conservative to guarantee correct execution. Unfortunately, the re- 
quirement for the compiler to be conservative can have dire effects on the per- 
formance of your code. 


We know from looking at the main calling function in tutor_d.c that in fact, these 
pointers all point to separate arrays in memory. However, from the compiler’s 
local view of lesson_c, this information is not available. 


Q How can you pass more information to the compiler to improve its perfor- 
mance? 


A The next example, lesson1_c provides the answer: 


Open lesson1_c.c and lesson1_c.asm 


Example 3-5. lesson1_c.c 


void lessonl_c(short * restrict xptr, short * restrict yptr, short *zptr, 
short *w_sum, int N) 
{ 
int i, w_vecl, w_vec2; 
short wl,w2; 


wl 


xptr[i] * wl; 
yptr[i] * w2; 
(w_vecl + w_vec2) >> 15; 


The only change made in lesson1_cis the addition of the restrict type qualifier 
for xptr and yptr. Since we know that these are actually separate arrays in 
memory from w_sum, in function lesson1_c, we can declare that nothing else 
points to these objects. No other pointer in lesson1_c.c points to xptr and no 
other pointer in lesson1_c.c points to yptr. See the TMS320C6000 Optimizing 
C/C++ Compiler User’s Guide for more information on the restrict type qualifi- 
er. Because of this declaration, the compiler knows that there are no possible 
dependency between xptr, yptr, and w_sum. Compiling this file creates feed- 
back as shown in Example 3-6, lesson1_c.asm: 
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Example 3-6. lesson1_c.asm 


SOFTWARE PIPELINE INFORMATION 


Known Minimum Trip Count 

Known Max Trip Count Factor 
Loop Carried Dependency Bound (%) 
Unpartitioned Resource Bound 
Partitioned Resource Bound (*) 
Resource Partition: 


NONOrRF 


A-sid 
units 0 
units 1 
units Ze 
units 1 
cross paths il 
.T address paths 2 
Long read paths 1 
Long write paths 0 
Logical ops (.LS) 1 
0 
1 
2 


* 


-S unit) 
5 6X 4D Unit) 


Addition ops (.LSD) 
Bound(.L .8 .LS) 


Bound(.L .S .D .LS .LSD) * 


w 

n 
PFROOCOFORPKREHOFE. 

Q, 


Searching for software pipeline schedule at ... 
ii = 2 Schedule found with 5 iterations in parallel 
done 


Collapsed epilog stages 
Prolog not entirely removed 
Collapsed prolog stages 


Minimum required memory pad 


Minimum safe trip count 


t+, A4 
*B44++,B6 
2 
BO,1,B0 
Cir 
A4,A5,A3 
B6,B5,B7 
1 
B7,A3,A3 


7 * 
7 * 
7* 
7* 
o* 
7 * 
7* 
7* 
7 * 
7 * 
7* 
7* 
o* 
7 * 
7* 
7* 
7* 
7 * 
7 * 
7* 
7* 
7 * 
2 * 
7 * 
7* 
7 * 
o* 
7 * 
7 * 
7* 
7 * 
2 * 
7 * 
7* 
o* 
7 * 
7* 
7 * 
o* 
7 * 
7 * 
7* 
7* 
7 * 
7 * 
7* 
7* 
2 * 
ok 
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At this point, the Loop Carried Dependency Bound is zero. By simply passing 
more information to the compiler, we allowed it to improve a 10-cycle loop to 
a 2-cycle loop. 


Lesson 4 in this tutorial shows how the compiler retrieves this type of informa- 
tion automatically by gaining full view of the entire program with program level 
optimization switches. 


A special option in the compiler, -mt, tells the compiler to ignore alias disambi- 
guation problems like the one described in lesson_c. Try using this option to 
rebuild the original lesson_c example and look at the results. 


Rebuild lesson_c.c using the —mt option 


1) From Project menu, choose Options. 


The Build Options dialog window appears. 
2) Select the Compiler tab. 
3) Inthe Category box, select Advanced. 


4) Inthe Aliasing drop-down box, select No Bad Alias Code. 


The -mt option will appear in the options window. 
5) Click OK to set the new options. 


6) Select lesson_c.c by selecting it in the project environment, or double— 
clicking on it in the Project View window. 
7) From the Project menu, choose Build, or click on the Build icon. 


If prompted, reload lesson_c.asm. 
8) From the File menu, chooose Open and select lesson_c.asm. 


You can now view lesson_c.asm in the main window. In the main window, you 
see that the file header contains a description of the options that were used 
to compile the file under Global File Parameters. The following line implies that 
—mt was used: 


;* Memory Aliases : Presume not aliases (optimistic) 
9) Scrolldown until you see the feedback embedded in the lesson_c.asm file. 
You now see the following: 


;* Loop Carried Dependency Bound(*) : 0 


7* ii = 2 Schedule found with 5 iterations in parallel 


This indicates that a 2—cycle loop was found. Lesson 2 will address information 
about potential improvements to this loop. 
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Table 3-1. Status Update: Tutorial example lesson_c lesson1_c 


Tutorial Example 

Potential pointer aliasing info (discussed in Lesson 1) 

Loop count info — minimum trip count (discussed in Lesson 2) 
Loop count info — max trip count factor (discussed in Lesson 2) 


Alignment info — xptr & yptr aligned on a word boundary (discussed in Lesson 
3) 


Cycles per iteration (discussed in Lesson 1-3) 
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Lesson_c 


x 


x 


10 


Vv 


x 


Lesson1_c 
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3.3 Lesson 2: Balancing Resources With Dual-Data Paths 


Lesson 1 showed you a simple way to make large performance gains in les- 
son_c. The result is lesson1_c with a 2-cycle loop. 


Q Is this the best the compiler can do? Is this the best that is possible on the 
VelociT! architecture? 


A Again, the answers lie in the amount of knowledge to which the compiler has 
access. Let’s analyze the feedback of lesson1_c to determine what improve- 
ments could be made: 


Open lesson1_c.asm 
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Example 3-7. lesson1_c.asm 


SOFTWARE PIPELINE INFORMATION 


Known Minimum Trip Count 

Known Max Trip Count Factor 
Loop Carried Dependency Bound (%) 
Unpartitioned Resource Bound 
Partitioned Resource Bound (*) 
Resource Partition: 


A-sid 
units 0 
units 1 
units Ze 
units 1 
cross paths 1 
.T address paths 2 
Long read paths 1 
Long write paths 0 
Logical ops (.LS) dl 
0 
1 
2 


* 


or .S unit) 
5 0r De Unit) 


Addition ops (.LSD) 
Bound(.L .S .LS) 
Bound(.L .S .D .LS .LSD) 


PRPRPODOOFRFOREE 


* 


Searching for software pipeline schedule at ... 
ii = 2 Schedule found with 5 iterations in parallel 
done 


Collapsed epilog stages 
Prolog not entirely removed 
Collapsed prolog stages 


Minimum required memory pad 


Minimum safe trip count 


+, A4 
+, B6 


BO,1,B0 
C17 
A4,A5,A3 
B6,B5,B7 
il 
B7,A3,A3 


2 * 
ok 
7 * 
7* 
7 * 
7 * 
7* 
7* 
7 * 
7 * 
7* 
7* 
7 * 
7 * 
7 * 
7* 
7 * 
7 * 
7 * 
7* 
7* 
7 * 
7 * 
7 * 
7 * 
7 * 
7 * 
7 * 
7 * 
7* 
o* 
7 * 
7* 
7* 
ok 
7 * 
7 * 
7 * 
7 * 
7 * 
7 * 
7* 
7 * 
7 * 
7 * 
7* 
7* 
7 * 
ok 
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The first iteration interval (ii) attempted was two cycles because the Partitioned 
Resource Bound is two. We can see the reason for this if we look below at the 
.D units and the .T address paths. This loop requires two loads (from xptr and 
yptr) and one store (to w_sum) for each iteration of the loop. 


Each memory access requires a .D unit for address calculation, and a .T ad- 
dress path to send the address out to memory. Because the ‘C6000 has two 
.D units and two .T address paths available on any given cycle (A side and B 
side), the compiler must partition at least two of the operations on one side (the 
A side). That means that these operations are the bottleneck in resources 
(highlighted with an *) and are the limiting factor in the Partitioned Resource 
Bound. The feedback in lesson1_c.asm shows that there is an imbalance in 
resources between the A and B side due, in this case, to an odd number of op- 
erations being mapped to two sides of the machine. 


Q Is it possible to improve the balance of resources? 


A One way to balance an odd number of operations is to unroll the loop. Now, 
instead of three memory accesses, you will have six, which is an even number. 
You can only do this if you know that the loop counter is a multiple of two; other- 
wise, you will incorrectly execute too few or too many iterations. In tutor_d.c, 
LOOPCOUNT is defined to be 40, which is a multiple of two, so you are able 
to unroll the loop. 


Q Why did the compiler not unroll the loop? 


A Inthe limited scope of lesson1_c, the loop counter is passed as a parameter 
to the function. Therefore, it might be any value from this limited view of the 
function. To improve this scope you must pass more information to the compil- 
er. One way to do this is by inserting a MUST_ITERATE pragma. AMUST_IT- 
ERATE pragmais a way of passing iteration information to the compiler. There 
is no code generated by a MUST_ITERATE pragma; it is simply read at com- 
pile time to allow the compiler to take advantage of certain conditions that may 
exist. In this case, we want to tell the compiler that the loop will execute a multi- 
ple of 2 times; knowing this information, the compiler can unroll the loop auto- 
matically. 


Unrolling a loop can incur some minor overhead in loop setup. The compiler 
does not unroll loops with small loop counts because unrolling may not reduce 
the overall cycle count. If the compiler does not know what the minimum value 
of the loop counter is, it will not automatically unroll the loop. Again, this is infor- 
mation the compiler needs but does not have in the local scope of lesson1_c. 
You know that LOOPCOUNT is set to 40, so you can tell the compiler that N 
is greater than some minimum value. lesson2_c demonstrates how to pass 
these two pieces of information. 
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Open lesson2_c.c 


Example 3-8. lesson2_c.c 


void lesson2_c(short * restrict xptr, short * restrict yptr, short *zptr, 
short *w_sum, int N) 
{ 
int i, w_vecl, w_vec2; 
short wl,w2; 


wl zptr[0]; 

w2 zptr[1]; 

#pragma MUST_ITERATE (20, , 2); 
for (i = 0; i < N; itt) 


xptr[i] * wl; 
yptr[i] * w2; 
= (w_veclt+w_vec2) >> 15; 


Inlesson2_c.c, no code is altered, only additional information is passed via the 
MUST_ITERATE pragma. We simply guarantee to the compiler that the trip 
count (in this case the trip count is N) is a multiple of two and that the trip count 
is greater than or equal to 20. The first argument for MUST_ITERATE is the 
minimum number of times the loop will iterate. The second argument is the 
maximum number of times the loop will iterate. The trip count must be evenly 
divisible by the third argument. See the TMS320C6000 Optimizing C/C++ 
Compiler User’s Guide for more information about the MUST_ITERATE prag- 
ma. 


For this example, we chose a trip count large enough to tell the compiler that 
itis more efficient to unroll. Always specify the largest minimum trip count that 
is safe. 


Open lesson2_c.asm and examine the feedback 
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Example 3-9. lesson2_c.asm 


SOFTWARE PIPELINE INFORMATION 


Loop Unroll Multiple 3 2 

Known Minimum Trip Count ¢ 0) 

Known Maximum Trip Count > 1073741823 
Known Max Trip Count Factor Hak 

Loop Carried Dependency Bound (%*) 
Unpartitioned Resource Bound 

Partitioned Resource Bound (*) 

Resource Partition: 


A-sid 
units 0 
units 
units 
units 
cross paths 
.T address paths 
Long read paths 
Long write paths 
Logical ops (.LS) 1 hor 8) unit) 
Addition ops (.LSD) 1 -L or .S or .D unit) 
Bound(.L .S .LS) 1 
Bound(.L .S .D .LS .LSD) 2 


Searching for software pipeline schedule at ... 
ii = 3 Schedule found with 5 iterations in parallel 
done 


Epilog not entirely removed 
Collapsed epilog stages 


Prolog not entirely removed 
Collapsed prolog stages eee] 


Minimum required memory pad : 8 bytes 


o* 
o* 
2* 
o* 
ok 
ok 
o* 
2* 
ok 
o* 
o* 
o* 
ok 
ok 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
ok 
o* 
o* 
o* 
o* 
o* 
2 * 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
2 * 
o* 


Minimum safe trip count 


Notice the following things in the feedback: 


A schedule with three cycles (ii=3): You can tell by looking at the .D units and 
.| address paths that this 3—cycle loop comes after the loop has been unrolled 
because the resources show a total of six memory accesses evenly balanced 
between the A side and B side. Therefore, our new effective loop iteration inter- 
val is 3/2 or 1.5 cycles. 


A Known Minimum Trip Count of 10: This is because we specified the count 
of the original loop to be greater than or equal to twenty and a multiple of two 
and after unrolling, this is cut in half. Also, a new line, Known Maximum Trip 
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Count, is displayed in the feedback. This represents the maximum signed inte- 
ger value divided by two, or 3FFFFFFFh. 


Therefore, by passing information without modifying the loop code, compiler 
performance improves from a 10—cycle loop to 2 cycles and now to 1.5 cycles. 


Q Is this the lower limit? 


A Check out Lesson 3 to find out! 


Table 3-2. Status Update: Tutorial example lesson_c lesson1_c lesson2_c 


Tutorial Example 

Potential pointer aliasing info (discussed in Lesson 1) 

Loop count info — minimum trip count (discussed in Lesson 2) 
Loop count info — max trip count factor (discussed in Lesson 2) 


Alignment info — xptr & yptr aligned on a word boundry (dis- 
cussed in Lesson 3) 


Cycles per iteration (discussed in Lesson 1-3) 


Lesson_c 


x 


x 


x 


x 
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Lessoni_c Lesson2_c 


V 


x 


V 


x AS 
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3.4 Lesson 3: Packed Data Optimization of Memory Bandwidth 


Lesson 2 produced a 3-cycle loop that performed two iterations of the original 
vector sum of two weighted vectors. This means that each iteration of our loop 
now performs six memory accesses, four multiplies, two adds, two shift opera- 
tions, a decrement for the loop counter, and a branch. You can see this phe- 
nomenon in the feedback of lesson2_c.asm. 


Open lesson2_c.asm 


Example 3-10. lesson2_c.asm 


SOFTWARE PIPELINE INFORMATION 


Loop Unroll Multiple 2 2x 

Known Minimum Trip Count % 10 

Known Maximum Trip Count : 1073741823 
Known Max Trip Count Factor soa: 

Loop Carried Dependency Bound (%) 0 
Unpartitioned Resource Bound 2 3 
Partitioned Resource Bound (*) es) 

Resource Partition: 


A-sid B-sid 
units 0 0 
units 1 
units a 
units 2 
cross paths 
.T address paths 
Long read paths 
Long write paths 
Logical ops (.LS) 1 ~L or «S unit) 
Addition ops (.LSD) 1 -L or .S or .D unit) 
Bound(.L .S .LS) i 
Bound(.L .S .D .LS .LSD) 2 


Searching for software pipeline schedule at ... 
ii = 3 Schedule found with 5 iterations in parallel 
done 


Epilog not entirely removed 
Collapsed epilog stages 


Prolog not entirely removed 
Collapsed prolog stages 


Minimum required memory pad 


Minimum safe trip count 


o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
ok 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
ok 
ox 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
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The six memory accesses appear as .D and .T units. The four multiplies ap- 
pear as .M units. The two shifts and the branch show up as .S units. The decre- 
ment and the two adds appear as .LS and .LSD units. Due to partitioning, they 
don’t all show up as .LSD operations. Two of the adds must read one value 
from the opposite side. Because this operation cannot be performed on the .D 
unit, the two adds are listed as .LS operations. 


By analyzing this part of the feedback, we can see that resources are most lim- 
ited by the memory accesses; hence, the reason for an asterisk highlighting 
the .D units and .T address paths. 


Q Does this mean that we cannot make the loop operate any faster? 
A Further insight into the C6000 architecture is necessary here. 


The C62x fixed-point device loads and/or stores 32 bits every cycle. In addi- 
tion, the C67x floating-point and ’C64x fixed-point device loads two 64-bit val- 
ues each cycle. In our example, we load four 16-bit values and store two 16-bit 
values every three cycles. This means we only use 32 bits of memory access 
every cycle. Because this is a resource bottleneck in our loop, increasing the 
memory access bandwidth further improves the performance of our loop. 


In the unrolled loop generated from lesson2_c, we load two consecutive 16-bit 
elements with LDHs from both the xptr and ypir array. 


Q Why not use a single LDW to load one 32-bit element, with the resulting reg- 
ister load containing the first element in one-half of the 32-bit register and the 
second element in the other half? 


A This is called Packed Data optimization. Two 16-bit loads are effectively per- 
formed by one single 32-bit load instruction. 


Q Why doesn’t the compiler do this automatically in lesson2_c? 


A Again, the answer lies in the amount of information the compiler has access 
to from the local scope of lesson2_c. 


In order to perform a LDW (32-bit load) on the ’C62x and ’C67x cores, the ad- 
dress must be aligned to a word address; otherwise, incorrect data is loaded. 
An address is word—aligned if the lower two bits of the address are zero. Unfor- 
tunately, in our example, the pointers, xptr and yptr, are passed into lesson2_c 
and there is no local scope knowledge as to their values. Therefore, the com- 
piler is forced to be conservative and assume that these pointers might not be 
aligned. Once again, we can pass more information to the compiler, this time 
via the _nassert statement. 


Open lesson3_c.c 
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Example 3-11. lesson3_c.c 


#define WORD_ALIGNED(x) (_nassert (( (int) (x) 


void lesson3_c(short * restrict xptr, short * restrict yptr, short *zptr, 
short *w_sum, int N) 
{ 
int i, w_vecl, w_vec2; 
short wl,w2; 


WORD_ALIGNE 
WORD_ALIGNE 


wl = zptr[0]; 

w2 = zptr[1]; 

#pragma MUST_ITE E(20; » 2)7 
for (i = 0; i < N; itt) 


{ 


w_vecl = xptr[i] * wl; 
w_vec2 = yptr[i] * w2; 
w_sum[i] = (w_veclt+w_vec2) >> 15; 


By asserting that xptr and yptr addresses ”anded” with 0x3 are equal to zero, 
the compiler knows that they are word aligned. This means the compiler can 
perform LDW and packed data optimization on these memory accesses. 


Open lesson3_c.asm 


3-20 


Lesson 3: Packed Data Optimization of Memory Bandwidth 


Example 3-12. lesson3_c.asm 


7 * 
7 * 
7* 
7* 
7* 
7 * 
7* 
7 * 
7* 
7 * 
7 * 
7* 
7* 
7 * 
7 * 
7* 
7* 
7* 
7 * 
7* 
7* 
7* 
7 * 
7 * 
7 * 
7 * 
7 * 
7 * 
7* 
7* 
7 * 
7 * 
7* 
7* 
7* 
7 * 
7 * 
7* 
7 * 


WARE PIPELINE INFORMATION 


Loop Unroll Multiple 1 2x 
Known Minimum Trip Count : 10 
Known Maximum Trip Count : 1073741823 
Known Max Trip Count Factor 5 2k 

Loop Carried Dependency Bound (%) 0 
Unpartitioned Resource Bound caer 
Partitioned Resource Bound (*) 2 

Resource Partition: 


A-sid 
units 0 
units 2 
units Bm 
units 2% 
cross paths al 
.T address paths 2 
Long read paths 1 
Long write paths 0 
Logical ops (.LS) a 1 “LO ..S: Unt) 
Addition ops (.LSD) 0 -L or .S or .D unit) 
Bound(.L .S .LS) 2* 1 
Bound(.L .S .D .LS .LSD) 2* 2* 


Searching for software pipeline schedule at 
ii = 2 Schedule found with 6 iterations in parallel 
done 


Epilog not entirely removed 
Collapsed epilog stages 


Prolog not removed 
Collapsed prolog stages 


inimum required memory pad 


inimum safe trip count 


Success! The compiler has fully optimized this loop. You can now achieve two 
iterations of the loop every two cycles for one cycle per iteration throughout. 


The .D and .T resources now show four (two LDWs and two STHs for two itera- 
tions of the loop). 
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Table 3-3. Status Update: Tutorial example lesson_c lesson1_c lesson2_c lesson3_c 


Tutorial Example 


Potential pointer aliasing info (discussed in Les- 
son 1) 


Loop count info — minimum trip count (discussed 
in Lesson 2) 


Loop count info — max trip count factor (dis- 
cussed in Lesson 2) 


Alignment info — xptr & yptr aligned on a word 
boundary (discussed in Lesson 3) 


Cycles per iteration (discussed in Lessons 1-3) 


3-22 
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x 


10 


Lessoni_c Lesson2_c 


V 


Vv 


Vv 


1.5 


Lesson3_c 


JV 
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3.5 Lesson 4: Program Level Optimization 


In Lesson 3, you learned how to pass information to the compiler. This in- 
creased the amount of information visible to the compiler from the local scope 
of each function. 


Q Is this necessary in all cases? 


A The answer is no, not in all cases. First, if this information already resides 
locally inside the function, the compiler has visibility here and restrict and 
MUST_ITERATE statements are not usually necessary. For example, if xptr 
and yptr are declared as local arrays, the compiler does not assume a depen- 
dency with w_sum. If the loop count is defined in the function or if the loop sim- 
ply described from one to forty, the MUST_ITERATE pragmais not necessary. 


Secondly, even if this type of information is not declared locally, the compiler 
can stillhave access to it in an automated way by giving it a program level view. 
This module discusses how to do that. 


The ’C6000 compiler provides two valuable switches, which enable program 
level optimization: -om and —op2. When these two options are used together, 
the compiler can automatically extract all of the information we passed in the 
previous examples. To tell the compiler to use program level optimization, you 
need to turn on —pm and —op2. 

Enable program level optimization 


1) From the Project menu, choose Options, and click on the Basic category. 


2) Select No External Refs in the Program Level Optimization drop-down 
box. This adds —pm and —op2 to the command line. 


View profile statistics 


1) Clear the Profile Statistics window by right clicking on itand selecting Clear 
All. 


2) From the Project menu, choose Rebuild All. 
3) From the File menu, choose Reload Program. 


4) From the Debug menu, chose Run. 


The new profile statistics should appear in the Profile Statistics window, as 
in Example 3-13. 
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Example 3-13. Profile Statistics 


Location 


lessonl_c. 
lesson2_c. 
lesson3_c. 
lessonl_c. 
lesson2_c. 
lesson3_c. 


c 
c 
Cc 
Cc 
c 
Cc 


lesson_c.c line 27 1 5020. 
lesson_c.c line 36 al 60. 


line 
line 
line 
line 
line 
line 


Count Average Total Maximum Minimum 
0 5020 5020 5020 
0 60 60 60 
31 1 60.0 60 60 60 
39 1 60.0 60 60 60 
44 al 60.0 60 60 60 
27 1 12.0 12 12 12 
29 1 12.0 12 12 12 
30 1 12.0 12 12 12 


This is quite a performance improvement. The compiler automatically extracts 
and acts upon all the information that we passed in Lessons 1 to 3. Even the 
original untouched tutor1 is 100% optimized by discounting memory depen- 
dencies, unrolling, and performing packed data optimization. 


Table 3-4. Status Update: Tutorial example lesson_c lesson1_c lesson2_c lesson3_c 


Tutorial Example 


Lesson_c Lessoni_c Lesson2_c_ Lesson3_c 


Potential pointer aliasing info (discussed in Les- x VA vA VA 
son 1) 

Loop count info — minimum trip count (discussed x x VS V 
in Lesson 2) 

Loop count info — max trip count factor (dis- x x VA V 


cussed in Lesson 2) 


Alignment info — xptr & yptr aligned on a word x x x JS 
boundary (discussed in Lesson 3) 

Cycles per iteration (discussed in Lesson 1-3) 10 2 1.5 1 
Cycles per iteration with program level optimiza- 1 1 1 1 


tion (discussed in Lesson 4) 
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This tutorial has shown you that a lot can be accomplished by both tuning your 
C code and using program level optimization. Many different types of tuning 
optimizations can be done in addition to what was presented here. 


We recommend you use Appendix A, Feedback Solutions, when tuning your 
code to get “how to” answers on all of your optimizing C questions. You can 
also use the Feedback Solutions Appendix as a tool during development. We 
believe this offers a significant advantage to Tl customers and we plan on con- 
tinuing to drive a more developer—friendly environment in our future releases. 
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3.6 Lesson 5: Writing Linear Assembly 


When the compiler does not fully exploit the potential of the C6000 architec- 
ture, you may be able to get better performance by writing your loop in linear 
assembly. Linear assembly is the input for the assembly optimizer. 


Linear assembly is similar to regular "C6000 assembly code in that you use 
°C6000 instructions to write your code. With linear assembly, however, you do 
not need to specify all of the information that you need to specify in regular 
’C6000 assembly code. With linear assembly code, you have the option of 
specifying the information or letting the assembly optimizer specify it for you. 
Here is the information that you do notneed to specify in linear assembly code: 


_j Parallel instructions 

Lj Pipeline latency 

_j Register usage 

Lj Which functional unit is being used 


If you choose not to specify these things, the assembly optimizer determines 
the information that you do not include, based on the information that it has 
about your code. As with other code generation tools, you might need to modify 
your linear assembly code until you are satisfied with its performance. When 
you do this, you will probably want to add more detail to your linear assembly. 
For example, you might want to specify which functional unit should be used. 


Before you use the assembly optimizer, you need to know the following things 
about how it works: 


_) Alinear assembly file must be specified with a .sa extension. 


_j Linear assembly code should include the .cproc and .endproc directives. 
The .cproc and .endproc directives delimit a section of your code that you 
want the assembly optimizer to optimize. Use .cproc at the beginning of 
the section and .endproc at the end of the section. In this way, you can set 
off sections of your assembly code that you want to be optimized, like pro- 
cedures or functions. 


[| Linear assembly code may include a .reg directive. The .reg directive al- 
lows you to use descriptive names for values that will be stored in regis- 
ters. When you use .reg, the assembly optimizer chooses a register whose 
use agrees with the functional units chosen for the instructions that oper- 
ate on the value. 


[| Linear assembly code may include a .trip directive. The .trip directive 
specifies the value of the trip count. The trip count indicates how many 
times a loop will iterate. 
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Let’s look at a new example, iircas4, which will show the benefit of using linear 
assembly. The compiler does not not optimally partition this loop. Thus, the iir- 
cas4 function does not improve with the C modification techniques we saw in 
the first portion of the chapter. In order to get the best partition, we must write 
the function in partitioned linear assembly. 


In order to follow this example in Code Composer Studio, you must open the 
ccs project , |_ tutorial.pjt, located in c:\ti\tutorial\sim62xx\linear_asm. Build the 
program and look at the software pipeline information feedback in the gener- 
ated assembly files. 


Example 3-14. Using the iircas4 Function in C 


void iircas4_l1(const int n, const short (* restrict c) [4], 
int *y) 
{ 
int ko, 
int yO 
int yl 


#pragma MUST_ITE 


for (i O; i 
{ 

[i] [0]>>16) 
[i] [0]>>16) 
kO>>16) + yl; 
kO>>16) + kl; 


(d 
(d 
( 
( 


Example 3-15 shows the assembly output from Example 3-14 
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Example 3-15. Software Pipelining Feedback From the lircas4 C Code 


SOFTWARE PIPELINE INFORMATION 


Known Minimum Trip Count 

Known Max Trip Count Factor 
Loop Carried Dependency Bound (%) 
Unpartitioned Resource Bound 
Partitioned Resource Bound (*) 
Resource Partition: 


A-sid 

units 0 
units 1 
units 2 
units 4 
cross paths 5 

.T address paths 2 
Long read paths ul 
Long write paths 0 
2 

4 

2 

3 


* 


ab 66 <S unit) 
-L or .S or .D unit) 


Logical ops (.LS) 
Addition ops (.LSD) 
Bound(.L .S .LS) 
Bound(.L .& .D .LS .LSD) 


WrWroorewWsBKRO 


Searching for software pipeline schedule at ... 
ii = 5 Schedule found with 4 iterations in parallel 
done 


Epilog not entirely removed 
Collapsed epilog stages 


Prolog not removed 
Collapsed prolog stages 


inimum required memory pad : 16 bytes 


inimum safe trip count panes 


7* 
o* 
7 * 
7 * 
7* 
7 * 
7 * 
7 * 
7* 
7* 
7 * 
7 * 
7* 
7* 
7 * 
7 * 
7 * 
7* 
7* 
7 * 
7 * 
7* 
7* 
7 * 
7 * 
7* 
7 * 
7 * 
7 * 
7 * 
7* 
7* 
7 * 
7 * 
7 * 
7* 
7 * 
7 * 


From the feedback in the generated .asm file, we can see that the compiler 
generated a suboptimal partition. Partitioning is placing operations and oper- 
ands on the A side or B side. We can see that the Unpartioned Resource 
Bound is 4 while the Partitioned Resource Bound is 5. When the Partitioned 
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Resource Bound is higher, this usually means we can make a better partition 
by writing the code in linear assembly. 


Notice that there are 5 cross path reads on the A side and only 3 on the B side. 
We would like 4 cross path reads on the A side and 4 cross path reads on the 
B side. This would allow us to schedule at an iteration interval (ii) of 4 instead 
of the currentii of 5. Example 3-16 shows how to rewrite the iircas4 ( ) function 
Using Linear Assembly. 
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Example 3-16. Rewriting the lircas4 () Function in Linear Assembly 


.def _iircas4_sa 
_iircas4_sa: -cproc AI,C,BD,AY 
-no_mdep 


.reg BDO, BD1,AA, AB, AJO, AFO, AEO, AGO, AHO, AYO, AKO, AMO, BDOO 
.reg BA2,BB2,BJ1,BF1,BE1,BG1,BH1,BY1,BK1, BM1 


LDW -D2 *+AY[0],AYO 
LDW «D2 *+AY[1],BY1 


-mptr C, bank+0O, 8 
-mptr BD, bank+4, 8 


-trip 10 

LDW sD2T1 *C++, AA H = c[i][0], al = 
LDW <D2TL *C++, AB 7 ela) (2),-b2 
LDW -D1T2 *BD[0O], BDO ; = d[i] [0] 

LDW -DLT2 *BD[1], BD1 ; = d[i] [1] 


PYH ‘ BD1, AA, AEO ; = >> 
PYHL ; BDO, AA, AJO ; = 

PYH ‘ BD1, AB, AGO 

PYHL : BDO, AB, AFO 


DD . AJO, AEO, AHO 
DD : AHO, AYO, AKO 
DD : AFO, AGO, AMO 
AMO, AKO, AYO 


AA, BA2 

AB, BB2 

BDO, BDOO 

KO, *BD[1] ; dali] [1] 


DOO, BA2, BEL j; el = 
KO, BA2, BJl ; jl = 
DOO, BB2, BGl j; gl = 
BB2, BFL ; fl = 


BYl, BHL ; hil = 
BE1, BKl ; kl = 
BGl, BML ; ml 

BK1, BYl ; yl = 


*BD++[2] ; Ali] [0] 


AI,1,AI 7 AS 
LOOP ; for 


STW F AYO, *+AY[0] 
STW : BY1,*+AY[1] 


-endproc 
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The following example shows the software pipeline feedback from 
Example 3-16. 


Example 3-17. Software Pipeline Feedback from Linear Assembly 


SOFTWARE PIPELINE INFORMATION 


Loop label : LOOP 

Known Minimum Trip Count 

Known Max Trip Count Factor 
Loop Carried Dependency Bound (%) 
Unpartitioned Resource Bound 
Partitioned Resource Bound (*) 
Resource Partition: 


A-sid 

units 

units 

units 

units 

cross paths 
.T address paths 
Long read paths 
Long write paths 
Logical ops (.LS) -L or .S unit) 
Addition ops (.LSD) -L or .S or .D unit) 
Bound(.L .S .LS) 
Bound(.L .S .D «LS .LSD) 


Searching for software pipeline schedule at ... 
ii = 4 Schedule found with 5 iterations in parallel 
done 


Epilog not entirely removed 
Collapsed epilog stages 


Prolog not removed 
Collapsed prolog stages 


Minimum required memory pad : 24 bytes 


Minimum safe trip count 22 


o* 
2 * 
o* 
ok 
o* 
o* 
2* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
ok 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
o* 
2 * 
o* 
o* 
o* 


Notice in Example 3—16 that each instruction is manually partitioned. From the 
software pipeline feedback information in Example 3-17, you can see that a 
software pipeline schedule is found at ii = 4. This is a result of rewriting the 
iircas4 ( ) function in linear assembly, as shown in Example 3-16. 
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4.1 Understanding Feedback 


The compiler provides some feedback by default. Additional feedback is gen- 
erated with the -mw option. The feedback is located in the .asm file that the 
compiler generates. In order to view the feedback, you must also enable -k 
which retains the .asm output from the compiler. By understanding feedback, 
you can quickly tune your C code to obtain the highest possible performance. 


The feedback in Example 1-1 is for an innermost loop. On the ‘C6000, C code 
loop performance is greatly affected by how well the compiler can software 
pipeline. The feedback is geared for explaining exactly what all the issues with 
pipelining the loop were and what the results obtained were. Understanding 
feedback will focus on all the components in the software pipelining feedback 
window. 


The compiler goes through three basic stages when compiling a loop. Here we 
will focus on the comprehension of these stages and the feedback produced 
by them. This, combined with the Feedback Solutions in Appendix A will send 
you well on your way to fully optimizing your code with the C6000 compiler. 
The three stages are: 


1) Qualify the loop for software pipelining 
2) Collect loop resource and dependency graph information 


3) Software pipeline the loop 


4.1.1 Stage 1: Qualify the Loop for Software Pipelining 


The result of this stage will show up as the first three or four lines in the feed- 
back window as long as the compiler qualifies the loop for pipelining: 


Example 4—1.Stage 1 Feedback 


Known Minimum Trip Count 
Known Maximum Trip Count 


Known Max Trip Count Factor 


(1 Trip Count. The number of iterations or trips through a loop. 


(1 Minimum Trip Count. The minimum number of times the loop might exe- 
cute given the amount of information available to the compiler. 


(1 Maximum Trip Count. The maximum number of times the loop might exe- 
cute given the amount of information available to the compiler. 
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_) Maximum Trip Count Factor. The maximum number that will divide 
evenly into the trip count. Even though the exact value of the trip count is 
not deterministic, it may be known that the value is a multiple of 2, 4, etc..., 
which allows more agressive packed data and unrolling optimization. 


The compiler tries to identify what the loop counter (named trip counter be- 
cause of the number of trips through a loop) is and any information about the 
loop counter such as minimum value (known minimum trip count), and wheth- 
er it is a multiple of something (has a known maximum trip count factor). 


If factor information is known about a loop counter, the compiler can be more 
aggressive with performing packed data processing and loop unrolling opti- 
mizations. For example, if the exact value of a loop counter is not known but 
it is known that the value is a multiple of some number, the compiler may be 
able to unroll the loop to improve performance. 


There are several conditions that must be met before software pipelining is al- 
lowed, or legal, from the compiler’s point of view. These conditions are: 


[J It cannot have too many instructions in the loop. Loops that are too big, 
typically require more registers than are available and require a longer 
compilation time. 


(J It cannot call another function from within the loop unless the called func- 
tion is inlined. Any break in control flow makes it impossible to software 
pipeline as multiple iterations are executing in parallel. 


If any of the conditions for software pipelining are not met, qualification of the 
pipeline will halt and a disqualification messages will appear. For more infor- 
mation about what disqualifies a loop from being software-pipelined, see sec- 
tion 2.5.3.6, on page 2-62. 
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4.1.2 Stage 2: Collect Loop Resource and Dependency Graph Information 


The second stage of software pipelining a loop is collecting loop resource and 
dependency graph information. The results of stage 2 will be displayed in the 
feedback window as follows: 


Example 4—2. Stage 2 Feedback 


units 
units 
units 
units 


Logical ops 
Addition ops 
Bound(.L .S 
Bound(.L .S 


o* 
, 
o* 
’ 
o* 
’ 
o* 
’ 
o* 
, 
o* 
, 
o* 
’ 
o* 
, 
o* 
’ 
o* 
, 
o* 
’ 
ox 
’ 
ok 
, 
ok 
, 
o* 
’ 
o* 
1 
ok 
, 


a) 


cross paths 
.T address paths 
Long read paths 
Long write paths 


Loop Carried Dependency Bound (%) 
Unpartitioned Resource Bound 
Partitioned Resource Bound (*) 
Resource Partition: 


A-sid 
2 


(.LS) 
(. LSD) 
.LS) 
.D .LS .LSD) 


-L or .S unit) 
-L or .S or .D unit) 


OWnAODWOrFRFOF !A 
FPWrFODOWOO SF 


Loop carried dependency bound. The distance of the largest loop carry 
path, if one exists. A loop carry path occurs when one iteration of a loop 
writes a value that must be read in a future iteration. Instructions that are 
part of the loop carry bound are marked with the * symbol in the assembly 
code saved with the —k option in the *.asm file. The number shown for the 
loop carried dependency bound is the minimum iteration interval due to a 
loop carry dependency bound for the loop. 


Often, this loop carried dependency bound is due to lack of knowledge by 
the compiler about certain pointer variables. When exact values of point- 
ers are not known, the compiler must assume that any two pointers might 
point to the same location. Thus, loads from one pointer have an implied 
dependency to another pointer performing a store and vice versa. This can 
create large (and usually unnecessary) dependency paths. When the 
Loop Carried Dependency Bound is larger than the Resource Bound, this 
is often the culprit. Potential solutions for this are shown in Appendix A, 
Feedback Solutions. 
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(J Unpartitioned resource bound across all resources. The best case re- 
source bound minimum iteration interval before the compiler has parti- 
tioned each instruction to the A or B side. In Example 4—2, the unparti- 
tioned resource bound is 4 because the .S units are required for 8 cycles, 
and there are 2 .S units. 


(J) Partitioned resource bound across all resources. The mii after the in- 
structions are partitioned to the A and B sides. In Example 4—2, after parti- 
tioning, we can see that the A side .L, .S, and .D units are required for a 
total of 13 cycles, making the partitioned resource bound [13/3] = 5. For 
more information, see the description of Bound (.L .S .D .LS .LSD) later 
in this section. 


() Resource partition table. Summarizes how the instructions have been 
assigned to the various machine resources and how they have been parti- 
tioned between the A and B side. An asterisk is used to mark those entries 
that determine the resource bound value — in other words the maximum 
mii. Because the resources on the C6000 architecture are fairly orthogo- 
nal, many instructions can execute 2 or more different functional units. For 
this reason, the table breaks these functional units down by the possible 
resource combinations. The table entries are described below: 


@ Individual Functional Units (.L .S .D .M) show the total number of 
instructions that specifically require the .L, .S, .D, or .M functional 
units. Instructions that can operate on multiple different functional 
units are notincluded in these counts. They are described below in the 
Logical Ops (.LS) and Addition Ops (.LSD) rows. 


m .Xcross paths represents the total number of AtoB and BtoA. When 
this particular row contains an asterisk, it has a resource bottleneck 
and partitioning may be a problem. 


Mm .T address paths represents the total number of address paths re- 
quired by the loads and stores in the loop. This is actually different 
from the number .D units needed as some other instructions may use 
the .D unit. In addition, there can be cases where the number of .T ad- 
dress paths on a particular side might be higher than the number of .D 
units if .D units are partitioned evenly between A and B and .T address 
paths are not. 


m@ Long read path represents the total number of long read port paths . 
All long operations with long sources use this port to do extended 
width (40-bit) reads. Store operations share this port so they also 
count toward this total. Long write path represents the total number of 
long write port paths. All instructions with long (40bit) results will be 
counted in this number. 
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Understanding Feedback 


Logical ops (.LS) represents the total number of instructions that can 
use either the .L or .S unit. 


Addition ops (.LSD) represents the total number of instructions that 
can use either the .L or .S or .D unit. 


Bound (.L .S .LS) represents the resource bound value as deter- 
mined by the number of instructions that use the .L and .S units. It is 
calculated with the following formula: 


Bound(.L .S .LS ) = ceil((.L + .S + .LS) / 2) 


Where ceil represents the ceiling function. This means you always 
round up to the nearest integer. In Example 4—2, if the B side needs: 


3 .L unit only instructions 

4 .S unit only instructions 

1 logical .LS instruction 

you would need at least [8/2] cycles or 4 cycles to issue these. 


Bound (.L .S .D .LS .LSD) represents the resource bound value as 
determined by the number of instructions that use the .D, .L and .S 
unit. It is calculated with the following formula: 

Bound(.L .S .D .LS .LSD) 

= ceil((.L + .S sD +. .LS + .LSD) / 3) 


Where ceil represents the ceiling function. This means you always 
round up to the nearest integer. In Example 4—2, the A side needs: 


2 .L unit only instructions, 4 .S unit only instructions, 1 .D unit only in- 
structions, 0 logical .LS instructions, and 6 addition .LSD instructions 


You would need at least [13/3] cycles or 5 cycles to issue these. 


Understanding Feedback 


4.1.3 Stage 3: Software Pipeline the Loop 


Once the compiler has completed qualification of the loop, partitioned it, and 
analyzed the necessary loop carry and resource requirements, it can begin to 
attempt software pipelining. This section will focus on the following lines from 
the feedback example: 


Example 4—3. Stage 3 Feedback 


Searching for software pipeline schedule at ... 

aa Register is live too long 

ii Did not find schedule 

ii Schedule found with 3 iterations in parallel 
done 


Epilog not entirely removed 
Collapsed epilog stages 


Prolog not removed 
Collapsed prolog stages : 0 


inimum required memory pad : 2 bytes 


ok 
’ 
o* 
y 
7 * 
’ 
ok 
’ 
o* 
’ 
ok 
’ 
ok 
’ 
o* 
’ 
ok 
’ 
o* 
’ 
7 * 
’ 
ok 
y 
ok 
’ 
ok 
’ 
7 * 
’ 


inimum safe trip count 2 


(J Iteration interval (ii). The number of cycles between the initiation of 
successive iterations of the loop. The smaller the iteration interval, the 
fewer cycles it takes to execute a loop. All of the numbers shown in each 
row of the feedback imply something about what the minimum iteration in- 
terval (mii) will be for the compiler to attempt initial software pipelining. 


Several things will determine what the mii of the loop is and are described 
in the following sections. The mii is simply the maximum of any of these 
individual mii’s. 


The first thing the compiler attempts during this stage, is to schedule the loop 
at an iteration interval (ii) equal to the mii determined in stage 2: collect loop 
resource and dependency graph information. In the example above, since the 
A-side bound (.L, .S, .D, .LS, and .LSD) was the mii bottleneck, our example 


starts with: 
ne Searching for software pipeline schedule at 
ee ii = 5 Register is live too long 


If the attempt was not successful, the compiler provides additional feedback 
to help explain why. In this case, the compiler cannot find a schedule at 11 
cycles because register is live too long. For more information about live too 
long issues, see section 5.10, on page 5-101. 
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Understanding Feedback 


Sometimes the compiler finds a valid software pipeline schedule but one or 
more of the values is live too long. Lifetime of a register is determined by the 
cycle a value is written into it and by the last cycle this value is read by another 
instruction. By definition, a variable can never be live longer than the ii of the 
loop, because the next iteration of the loop will overwrite that value before it 
is read. 


The compiler then proceeds to: 
ii = 6 Did not find schedule 


Sometimes, due to a complex loop or schedule, the compiler simply cannot 
find a valid software pipeline schedule at a particular iteration interval. 


Regs Live Always : 1/5 (A/B-side) 
Max Regs Live : 14/19 
Max Cond Regs Live : 1/0 


Lj] Regs Live Always refers to the number of registers needed for variables 
to be live every cycle in the loop. Data loaded into registers outside the 
loop and read inside the loop will fall into this category. 


() Max Regs Live refers to the maximum number of variable live on any one 
cycle in the loop. If there are 33 variables live on one of the cycles inside 
the loop, a minimum of 33 registers is necessary and this will not be pos- 
sible with the 32 registers available on the ’C62x and ’C67x cores. In addi- 
tion, this is broken down between A and B side, so if there is uneven parti- 
tioning with 30 values and there are 17 on one side and 13 on the other, 
the same problem will exist. This situation does not apply to the 64 regis- 
ters available on the ’C64x core. 


() Max Cond Regs Live tells us if there are too many conditional values 
needed on a given cycle. The ’C62x and ’C67x cores have 2 A side and 
3 B side condition registers available. The ’C64x core has 3 A side and 3 
B side condition registers available. 


After failing at ii = 6, the compiler proceeds to ii = 7: 


ii = 7 Schedule found with 3 iterations in parallel 


It is successful and finds a valid schedule with 3 iterations in parallel. This 
means it is pipelined 3 deep. In other words, before iteration n has completed, 
iterations n+1 and n+2 have begun. 


Each time a particular iteration interval fails, the ii is increased and retried. This 
continues until the ii is equal to the length of a list scheduled loop (no software 
pipelining). This example shows two possible reasons that a loop was not soft- 
ware pipelined. To view the full detail of all possible messages and their de- 
scriptions, see Feedback Solutions in Appendix A. 


Understanding Feedback 


After a successful schedule is found at a particular iteration interval, more in- 
formation about the loop is displayed. This information may relate to the load 
threshold, epilog/prolog collapsing, and projected memory bank conflicts. 


Speculative Load Threshold : 12 


When an epilog is removed, the loop is run extra times to finish out the last it- 
erations, or pipe—down the loop. In doing so, extra loads from new iterations 
of the loop will speculatively execute (even though their results will never be 
used). In order to ensure that these memory accesses are not pointing to inval- 
id memory locations, the Load Threshold value tells you how many extra bytes 
of data beyond your input arrays must be valid memory locations (not a 
memory mapped I/O etc) to ensure correct execution. In general, in the large 
address space of the ‘C6000 this is not usually an issue, but you should be 
aware of this. 


Epilog not entirely removed 
Collapsed epilog stages : 1 


This refers to the number of epilog stages, or loop iterations that were re- 
moved. This can produce alarge savings in code size. The —mh enables spec- 
ulative execution and improves the compiler’s ability to remove epilogs and 
prologs. However, in some cases epilogs and prologs can be partially or en- 
tirely removed without speculative execution. Thus, you may see nonzero val- 
ues for this even without the —mh option. 


Prolog not removed 
Collapsed prolog stages : 0 


This means that the prolog was not removed. For various technical reasons, 
prolog and epilog stages may not be partially or entirely removed. 

Minimum required memory pad : 2 bytes 
The minimum required memory padding to use -mh is 2 bytes. See the 


TMS320C6000 Optimizing C/C++ Compiler User’s Guide for more informa- 
tion on the -mh option and the minimum required memory padding. 


Minimum safe trip count :2 
This means that the loop must execute at lease twice to safely use the software 
pipelined version of the loop. If this value is less than the known minimum trip 


count, two versions of the loop will be generated. For more information on elim- 
inating redundant loops, see section 2.5.3.2, on page 2-55. 
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Loop Disqualification Messages 


4.2 Loop Disqualification Messages 


4.2.1 


Bad Loop Structure 


Description 
This error is very rare and may stem from the following: 
1 An asm statement inserted in the C code innerloop. 


_} Parallelinstructions being used as input to the Linear Assembly Optimizer. 


1 Complex control flow such as GOTO statements, breaks, nested if state- 
ments, if-else statements, and large if statements. 


Solution 


Remove any asm statements, complex control flow, or parallel instructions as 
input to linear assembly. 


4.2.2 Loop Contains a Call 


Description 


There are occasions when the compiler may not be able to inline a function call 
that is in a loop. This may be due to the compiler being unable to inline the 
function call; the loop could not be software pipelined. 


Solution 


If the caller and the callee are C or C++, use —pm and —op2. See the 
TMS320C6000 Opimizing C/C++ Compiler User’s Guide for more information 
on the correct usage of —op2. Do not use —oi0, which disables automatic inlin- 
ing. 


Add the inline keyword to the callee’s function definition. 


4.2.3. Too Many Instructions 


Oversized loops, typically, will not schedule due to too many registers needed. 
This may also cause additional compilation time in the compiler. The limit on 
the number of instructions is variable. 


Solution 
Use intrinsics in C code to select more efficient "C6000 instructions. 


Write code in linear assembly to pick exact C6000 instruction to be executed. 


4.2.4 


4.2.5 


4.2.6 


4.2.7 


4.2.8 


Loop Disqualification Messages 


For more information... 
See section 2.5.1, Using Intrinsics, on page 2-23. 


See Chapter 7, Optimizing Assembly Code via Linear Assembly. 


Software Pipelining Disabled 


Software pipelining has been disabled by a command-ine option. Pipelining will 
be turned off when using the —mu option, not using —02/-03, or using — ms2/-ms3. 


Uninitialized Trip Counter 


The trip counter may not have been set to an initial value. 


Suppressed to Prevent Code Expansion 


Software pipelining may be suppressed because of the —ms1 flag. When the 
—ms1 flag is used, software pipelining is disabled in less promising cases to 
reduce code size. To enable pipelining, use —msO or omit the —ms flag alto- 
gether. 


Loop Carried Dependency Bound Too Large 


If the loop has complex loop control, try -mh according to the recommenda- 
tions in the TMS320C6000 Optimizing C/C++ Compiler User’s Guide. 


Cannot Identify Trip Counter 


The loop control is too complex. Try to simplify the loop. 
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Pipeline Failure Messages 


4.3 Pipeline Failure Messages 


4.3.1 Address Increment Too Large 


Description 


A particular function the compiler performs when software pipelining is to allow 
reordering of all loads and stores occurring from the same array or pointer. 
This allows for maximum flexibility in scheduling. Once a schedule is found, 
the compiler will return and add the appropriate offsets and increment/decre- 
ments to each load and store. Sometimes, the loads and/or stores end up be- 
ing offset too far from each other after reordering (the limit for standard load 
pointers is +/— 32) . If this happens, the best bet is to restructure the loop so 
that the pointers are closer together or rewrite the pointers to use register off- 
sets that are precomputed. 


Solution 


Modify code so that the memory offsets are closer. 


4.3.2 Cannot Allocate Machine Registers 


Description 


After software pipelining and finding a valid schedule, the compiler must allo- 
cate all values in the loop to specific machine registers (AO-A15 and BO-B15 
for the ’*C62x and 'C67x, or AO-A31 and BO—B31 for the ’C64x). There are oc- 
casions when software pipelining this particular ii is not possible. This may be 
due to the loop schedule found requiring more registers than the ‘C6000 has 
available. The analyzing feedback example shows: 


ii = 12 Cannot allocate machine registers 
Regs Live Always : 1/5 (A/B-side) 

Max Regs Live : 14/19 

Max Cond Regs Live : 1/0 


Regs Live Always refers to the number of registers needed for variables live 
every cycle in the loop. Data loaded into registers outside the loop and read 
inside the loop will fall into this category. 


Max Regs Live refers to the maximum number of variables live on any one 
cycle in the loop. If there are 33 variables live on one of the cycles inside the 
loop, a minimum of 33 registers is necessary and this will not be possible with 
the 32 registers available on the C62/C67 cores. 64 registers are available on 
the ’C64x core. In addition, this is broken down between A and B side, so if 
there is uneven partitioning with 30 values and there are 17 on one side and 
13 on the other, the same problem will exist. 


Pipeline Failure Messages 


Max Cond Regs Live tells us if there are too many conditional values needed 
ona given cycle. The ’C62x/’C67x cores have 2 A side and 3 B side condition 
registers available. The 'C64x core has 3 A side and 3 B side condition regis- 
ters available. 


Solution 


Try splitting the loop into two separate loops. Repartition if too many instruc- 
tions on one side. 


For loops with complex control, try the —mh option. 


Use symbolic register names instead of machine registers (AO-A15 and 
BO-—B15 for ’C62x and ’C67x, or AO—A31 and BO—B31 for ’C64x). 


For More Information... 

See section 5.9, Loop Unrolling (in Assembly), on page 5-94. 
See section 2.5.3.4, Loop Unrolling (in C), on page 2-57. 
TMS320C6000 C/C++ Compiler User’s Guide 


4.3.3. Cycle Count Too High. Not Profitable 
Description 


In rare cases, the iteration interval of a software pipelined loop is higher than 
anon-pipelined list scheduled loop. In this case, it is more efficient to execute 
the non-software pipelined version. 


Solution 

Split into multiple loops or reduce the complexity of the loop if possible. 
Unpartition/repartition the linear assembly source code. 

Add const and restrict keywords where appropriate to reduce dependences. 
For loops with complex control, try the —mh option. 

Probably best modified by another technique (i.e. loop unrolling). 

Modify the register and/or partition constraints in linear assembly. 

For more information... 

See section 5.9, Loop Unrolling, on page 5-94. 

TMS320C6000 C/C++ Compiler User's Guide 
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Pipeline Failure Messages 


4.3.4 Did Not Find Schedule 


Description 


Sometimes, due to a complex loop or schedule, the compiler simply cannot 
find a valid software pipeline schedule at a particular iteration interval. 


Solution 

Split into multiple loops or reduce the complexity of the loop if possible. 
Unpartition/repartition the linear assembly source code. 

Probably best modified by another technique (i.e. loop unrolling). 
Modify the register and/or partition constraints in linear assembly. 

For more information... 


See section 5.9, Loop Unrolling, on page 5-94. 


4.3.5 Iterations in Parallel > Max. Trip Count 


Description 


Not all loops can be profitably pipelined. Based on the available information 
on the largest possible trip count, the compiler estimates that it will always be 
more profitable to execute a non-pipelined version than to execute the pipe- 
lined version, given the schedule that it found at the current iteration interval. 


Solution 

Probably best optimized by another technique (i.e. unroll the loop completely). 
For more information... 

See section 5.9, Loop Unrolling (in Assembly), on page 5-94. 

See section 2.5.3.4, Loop Unrolling (in C), on page 2-57. 

See section 2.5.3, Software Pipelining, on page 2-53. 


4.3.6 Speculative Threshold Exceeded 


Description 


It would be necessary to speculatively load beyond the threshold currently 
specified by the —mh option. 


Solution 


Increase the —mh threshold as recommended in the software pipeline feed- 
back located in the assembly file. 


Pipeline Failure Messages 


4.3.7. Iterations in Parallel > Min. Trip Count 
Description 


Based on the available information on the minimum trip count, it is not always 
safe to execute the pipelined version of the loop. Normally, a redundant loop 
would be generated. However, in this case, redundant loop generation has 
been suppressed via the —-ms0/—ms1 option. 


Solution 


Add MUST_ITERATE pragma or .trip to provide more information on the mini- 
mum trip count 


If adding —mh or using a higher value of —mhn could help, try the following 
suggestions: 


_} Use-pm program level optimization to gather more trip count information. 


_j Use the MUST_ITERATE pragmaor the .trip directive to provide minimum 
trip count information. 


For more information... 


See section 2.5.3.3, Communicating Trip Count Information to the Compiler, 
on page 2-56. 


See section 5.2.5, The .trip Directive, on page 5-8. 


4.3.8 Register is Live Too Long 
Description 


Sometimes the compiler finds a valid software pipeline schedule but one or 
more of the values is live too long. Lifetime of a register is determined by the 
cycle a value is written into it and by the last cycle this value is read by another 
instruction. By definition, a variable can never be live longer than the ii of the 
loop, because the next iteration of the loop will overwrite that value before it 
is read. 


After this message, the compiler prints out a detailed description of which val- 
ues are live to long: 


ii = 11 Register is live too long 
|72| -> |74| 
[73 == | 7S] 


The numbers 72, 73, 74, and 75 correspond to line numbers and they can be 
mapped back to the offending instructions. 
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Pipeline Failure Messages 


Solution 


Write linear assembly and insert MV instructions to split register lifetimes that 
are live—too—long. 


For more information... 


See section 5.10.4.1, Split-Join—Path Problems, on page 5-104. 


4.3.9 Too Many Predicates Live on One Side 


Description 


The C6000 has predicate, or conditional, registers available for use with condi- 
tional instructions. There are 5 predicate registers on the 'C62x and ’C67x, and 
6 predicate registers on the ’C64x. There are two or three on the A side and 
three on the B side. Sometimes the particular partition and schedule combina- 
tion, requires more than these available registers. 


Solution 
Try splitting the loop into two separate loops. 


If multiple conditionals are used in the loop, allocation of these conditionals is 
the reason for the failure. Try writing linear assembly and partition all instruc- 
tions, writing to condition registers evenly between the A and B sides of the 
machine. For the ’C62x and ’C67x, if there is an uneven number, put more on 
the B side, since there are 3 condition registers on the B side and only 2 on 
the A side. 


4.3.10 Too Many Reads of One Register 


Description 


The ’C62x,’C64x, and 'C67x cores can read the same register a maximum of 
4 times per cycle. If the schedule found happens to produce code that allows 
a single register to be read more than 4 times in a given cycle, the schedule 
is invalidated. This code invalidation is not common. If and when it does occur 
onthe ’C67x, it possibly due to some floating point instructions that have multi- 
ple cycle reads. 


Pipeline Failure Messages 


Solution 

Split into multiple loops or reduce the complexity of the loop if possible. 
Unpartition/repartition the linear assembly source code. 

Probably best modified by another technique (i.e. loop unrolling). 
Modify the register and/or partition constraints in linear assembly. 

For more information... 

See section 5.9, Loop Unrolling (in Assembly), on page 5-94. 


See section 2.5.3.4, Loop Unrolling (in C), on page 2-57. 


4.3.11 Trip var. Used in Loop — Can’t Adjust Trip Count 
Description 


If the loop counter (named trip counter because of the number of trips through 
a loop) is modified within the body of the loop, it typically cannot be converted 
into a downcounting loop (needed for software pipelining on the ’C6000). If 
possible, rewrite the loop to not modify the trip counter by adding a separate 
variable to be modified. 


The fact that the loop counter is used in the loop is actually determined much 
earlier in the loop qualification stage of the compiler. Why did the compiler try 
to schedule this anyway? The reason has to do with the —mh option. This op- 
tion allows for extraneous loads and facilitates epilog removal. If the epilog was 
successfully removed, the loop counter can sometimes be altered in the loop 
and still allow software pipelining. Sometimes, this isn’t possible after schedul- 
ing and thus the feedback shows up at this stage. 


Solution 


Replicate the trip count variable and use the copy inside the loop so that the 
trip counter and the loop reference separate variables. 


Use the —mh option. 
For more information... 


See section 2.5.3.6, What Disqualifies a Loop From Being Software Pipelined, 
on page 2-62. 
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Investigative Feedback 


4.4 Investigative Feedback 


4.4.1. Loop Carried Dependency Bound is Much Larger Than Unpartitioned 


Resource Bound 


Description 


If the loop carried dependency bound is much larger than the unpartitioned re- 
source bound, this can be an indicator that there is a potential memory alias 
disambiguation problem. This means that there are two pointers that may or 
may not point to the same location, and thus, the compiler must assume they 
might. This can cause a dependency (often between the load of one pointer 
and the store of another) that does not really exist. For software pipelined 
loops, this can greatly degrade performance. 


Solution 
Use —pm program level optimization to reduce memory pointer aliasing. 


Add restrict declarations to all pointers passed to a function whose objects do 
not overlap. 


Use —mt option to assume no memory pointer aliasing. 

Use the .mdep and .no_mdep assembly optimizer directives. 
If the loop control is complex, try the -mh option. 

For More Information... 


See section 5.2, Assembly Optimizer Options and Directives, on page 5-4. 


4.4.2 Two Loops are Generated, One Not Software Pipelined 


Description 


If the trip count is too low, it is illegal to execute the software pipelined version 
of the loop. In this case, the compiler could not guarantee that the minimum 
trip count would be high enough to always safely execute the pipelined ver- 
sion. Hence, it generated a non-pipelined version as well. Code is generated, 
so that at run-time, the appropriate version of the loop will be executed. 


Solution 


Check the software pipeline loop information to see what the compiler knows 
about the trip count. If you have more precise information, provide it to the com- 
piler using one of the following methods: 


Investigative Feedback 


_) Use the MUST_ITERATE pragma to specify loop count information in c 
code. 


_) Use the .trip directive to specify loop count information in linear assembly. 


Alternatively, the compiler may be able to determine this information on its own 
when you compile the function and callers with som and —op2. 


For More Information... 


See section 2.5.3.3, Communicating Trip Count Information to the Compiler, 
on page 2-56. 


See section 5.2.5, The .trip Directive, on page 5-8. 


4.4.3. Uneven Resources 
Description 


If the number of resources to do a particular operation is odd, unrolling the loop 
is sometimes beneficial. If a loop requires 3 multiplies, then a minimum itera- 
tion interval of 2 cycles is required to execute this. If the loop was unrolled, 6 
multiplies could be evenly partitioned across the A and B side, having a mini- 
mum ii of 3 cycles, giving improved performance. 


Solution 

Unroll the loop to make an even number of resources. 

For More Information... 

See section 5.9, Loop Unrolling (in Assembly), on page 5-94. 


See section 2.5.3.4, Loop Unrolling (in C), on page 2-57. 


4.4.4 Larger Outer Loop Overhead in Nested Loop 
Description 


In cases where the inner loop count of a nested loop is relatively small, the time 
to execute the outer loop can start to become a large percentage of the total 
execution time. For cases where this significantly degrades overall loop per- 
formance, unrolling the inner loop may be desired. 


Solution 


Unroll the inner loop. 
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Investigative Feedback 


Make one loop with the outer loop instructions conditional on an inner loop 
counter 


For More Information 
See Chapter 5, Loop Unrolling (In C) (In Assembly), on page 5-118. 


See section 5.14, Outer Loop Conditionally Executed With Inner Loop, on 
page 5-136. 


4.4.5 There are Memory Bank Conflicts 


Description 


In cases where the compiler generates 2 memory accesses in one cycle and 
those accesses are either 8 bytes apart on a ’C620x device, 16 bytes apart 
on a’C670x device, or 32 bytes apart on a’C640x device, AND both accesses 
reside within the same memory block, a memory bank stall will occur. To avoid 
this degradation, memory bank conflicts can be completely avoided by either 
placing the two accesses in different memory blocks or by writing linear as- 
sembly and using the .mptr directive to control memory banks. 


Solution 

Write linear assembly and use the .mptr directive 

Link different arrays in separate memory blocks 

For More Information 

See section 5.2.4, The .mptr Directive, on page 5-5. 

See section 5.9, Loop Unrolling (in Assembly), on page 5-94. 
See section 2.5.3.4, Loop Unrolling (in C), on page 2-57. 


See section 5.12, Memory Banks, on page 5-118 


4.4.6 T Address Paths Are Resource Bound 


4-20 


Description 


T address paths defined the number of memory accesses that must be sent 
out on the address bus each loop iteration. If these are the resource bound for 
the loop, itis often possible to reduce the number of accesses by performing 
word accesses (LDW/STW) for any short accesses being performed. 


Solution 


Investigative Feedback 


Use word accesses for short arrays; declare int * (or use _nassert) and use 
—mpy intrinsics to multiply upper and lower halves of registers 


Try to employ redundant load elimination technique if possible 

Use LDW/STW instructions for accesses to memory 

For More Information... 

See section 2.5.2, Using Word Accesses for Short Data (C), on page 2-33. 
See section 5.11, Redundant Load Elimination, on page 5-110. 


See section 5.4, Using Word Access for Short Data (Assembly), on page 5-19. 
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Chapter 5 


Optimizing Assembly Code 
via Linear Assembly 


This chapter describes methods that help you develop more efficient 
assembly language programs, understand the code produced by the 
assembly optimizer, and perform manual optimization. 


This chapter encompasses phase 3 of the code development flow. After you 
have developed and optimized your C code using the C6000 compiler, extract 
the inefficient areas from your C code and rewrite them in linear assembly (as- 
sembly code that has not been register-allocated and is unscheduled). 


The assembly code shown in this chapter has been hand-optimized in order 
to direct your attention to particular coding issues. The actual output from the 
assembly optimizer may look different, depending on the version you are us- 


ing. 


Topic Page 
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5.1 Linear Assembly Code 


The source that you write for the assembly optimizer is similar to assembly 
source code; however, linear assembly does not include information about 
parallel instructions, instruction latencies, or register usage. The assembly op- 
timizer takes care of the difficulties of streamlining your code by: 


_j Finding instructions that can be executed in parallel 
(} Handling pipeline latencies during software pipelining 
_j Assigning register usage 

(j Defining which unit to use 


Although you have the option with the C6000 to specify the functional unit or 
register used, this may restrict the compiler’s ability to fully optimize your code. 
See the TMS320C6000 Optimizing C/C++ Compiler User’s Guide for more in- 
formation. 


This chapter takes you through the optimization process manually to show you 
how the assembly optimizer works and to help you understand when you might 
want to perform some of the optimizations manually. Each section introduces 
optimization techniques in increasing complexity: 


(1 Section 5.3 and section 5.4 begin with a dot product algorithm to show you 
how to translate the C code to assembly code and then how to optimize 
the linear assembly code with several simple techniques. 


LL) Section 5.5 and section 5.6 introduce techniques for the more complex al- 
gorithms associated with software pipelining, such as modulo iteration in- 
terval scheduling for both single-cycle loops and multicycle loops. 


(1 Section 5.7 uses an IIR filter algorithm to discuss the problems with loop 
carry paths. 


1 Section 5.8 and section 5.9 discuss the problems encountered with if- 
then-else statements in a loop and how loop unrolling can be used to re- 
solve them. 


(4) Section 5.10 introduces live-too-long issues in your code. 


1 Section 5.11 uses a simple FIR filter algorithm to discuss redundant load 
elimination. 


LJ Section 5.12 discusses the same FIR filter in terms of the interleaved 
memory bank scheme used by ’C6000 devices. 


(j Section 5.13 and section 5.14 show you how to execute the outer loop of 
the FIR filter conditionally and in parallel with the inner loop. 


Linear Assembly Code 


Each example discusses the: 
(J Algorithm in C code 


Translation of the C code to linear assembly 


| 
_j Dependency graph to describe the flow of data in the algorithm 


Allocation of resources (functional units, registers, and cross paths) in lin- 
ear assembly 


ee 
Note: 
There are three types of code for the ‘C6000: C/C++ code (which is input for 
the C/C++ compiler), linear assembly code (which is input for the assembly 
optimizer), and assembly code (which is input for the assembler). 


In the three sections following section 5.2, we use the dot product to demon- 
strate how to use various programming techniques to optimize both perfor- 
mance and code size. Most of the examples provided in this book use fixed- 
point arithmetic; however, the three sections following section 5.2 give both 
fixed-point and floating-point examples of the dot product to show that the 
same optimization techniques apply to both fixed- and floating-point pro- 


grams. 


Optimizing Assembly Code via Linear Assembly 5-3 


Assembly Optimizer Options and Directives 


5.2 Assembly Optimizer Options and Directives 


All directives and options that are described in the following sections are listed 
in greater detail in Chapter 4 of the TMS320C6000 Optimizing C/C++ Compil- 
er User’s Guide. 


5.2.1. The —-on Option 


Software pipelining requires the -02 or -03 option. Not specifying -02 or -03 fa- 
cilitates faster compile time and ease of development through reduced opti- 
mization. 


5.2.2 The —mt Option and the .no_mdep Directive 


Because the assembly optimizer has no idea where objects you are accessing 
are located when you perform load and store instructions, the assembly opti- 
mizer is by default very conservative in determining dependencies between 
memory operations. For example, let us say you have the following loop de- 
fined in a linear assembly file: 


Example 5-1. Linear Assembly Block Copy 


loop: 
ldw *regl++, reg2 
add reg2, reg3, reg4 
stw reg4, *reg5t++ 
[reg6é] add -l, reg6, reg6é 
[reg6é] b loop 


The assembly optimizer will make sure that each store to “reg5” completes be- 
fore the next load of “regi”. A suboptimal loop would result if the store to ad- 
dress in reg5 in not in the next location to be read by “reg1”. For loops where 
“reg5” is pointing to the next location of “reg1”, this is necessary and implies 
that the loop has a loop carry path. See section 5.7, Loop Carry Paths, on page 
5-77 for more information. 


For mostloops, this is not the case, and you can inform the assembly optimizer 
to be more aggressive about scheduling memory operations. You can do this 
either by including the “.no_mdep” (no memory dependencies) directive in 
your linear assembly function or with the -mt option when you are compiling 
the linear assembly file. Be aware that if you are compiling both C code and 
linear assembly code in your application, that the -mt option has different 
meanings for both C and linear assembly code. In this case, use the .no_mdep 
directive in your linear assembly source files. 
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For a full description on the implications of .no_mdep and the -mt option, refer 
to Appendix B, Memory Alias Disambiguation. Refer to the TMS320C6000 
Optimizing C/C++ Compiler User’s Guide for more information on both the -mt 
option and the .no_mdep directive. 


5.2.3. The .mdep Directive 


Should you need to specify a dependence between two or more memory refer- 
ences, use the .mdep directive. Annotate your code with memory reference 
symbols and add the .mdep directive to your linear assembly function. 


Example 5—2. Block copy With .mdep 


-mdep ldl, 
LDW *pl++ {ld1}, 
; other code ... 
STW outp2,*p2t++ {stl} ; annotate memory reference stl 


inpl ; annotate memory reference ldl 


The .mdep directive indicates there is a memory dependence from the LDW 
instruction to the STW instruction. This means that the STW instruction must 
come after the LDW instruction. The .mdep directive does not imply that there 
is amemory dependence from the STW to the LDW. Another .mdep directive 
would be needed to handle that case. 


5.2.4 The .mptr Directive 


The .mptr directive gives the assembly optimizer information on how to avoid 
memory bank conflicts. The assembly optimizer will rearrange the memory ref- 
erences generated in the assembly code to avoid the memory bank conflicts 
that were specified with the .mpitr directive. This means that code generated 
by the assembly optimizer will be faster by avoiding the memory bank conflicts. 
Example 5-3 shows linear assembly code and the generated loop kernel for 
a dot product without the .mptr directive. 


Example 5—3. Linear Assembly Dot Product 


dotp: 


loop: 


-cproc ptr_a, ptr_b, cnt 
vall, val2, val3, val4 
prodl, prod2, suml, sum2 


.reg 
reg 
Zero 
Zero 
<braip 


20 
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Example 5—3.Linear Assembly Dot Product (Continued) 


1dh 
1dh 
mpy 
add 
1dh 
1dh 
mpy 
add 
[cnt] add 
[cnt] b 
add 


*ptr_at+, vall 
*ptr_b++, val2 
vall, val2, prodl 
suml, prodl, suml 
*ptroatt, vall 
*ptr_b++, val2 
val3, val4, prod2 
sum2, prod2, sum2 


=i 5 


loop 
suml, sum2, suml 
return suml 
-endproc 


ent, Cnt 


loop: 
'AL 


BO 


Al 
Al 


BO 


<loop kernel generated> 


PIPED LOOP KERNEL 
ADD .L2 B4,B6,B4 

PY .M2X B7,A0,B6 
B «SL loop 

LDH .D2T2 *-B5 (2) ,B6 
LDH .D1T1 *-A4(2),A0 
SUB Sl Al,1,Al1 
ADD = ial, A5,A3,A5 
MPY -M1X B6,A0,A3 
ADD 342 -1,B0,B0O 
LDH <D2T2 *B5++(4),B7 
LDH .D1T1 *A4++ (4) ,A0 
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If the arrays pointed to by ptr_a and ptr_b begin on the same bank, then there 
will be memory bank conflicts at every cycle of the loop due to how the LDH 
instructions are paired. 


By adding the .mptr directive information, you can avoid the memory bank con- 
flicts. Example 5—4 shows the linear assembly dot product with the .mptr direc- 
tive and the resulting loop kernel. 


Assembly Optimizer Options and Directives 


Example 5—4. Linear Assembly Dot Product With .mptr 


dotp: .cproc ptr_a, ptr_b, cnt 
-reg vall, val2, val3, val4 
.reg prodl, prod2, suml, sum2 
zero suml 
zero sum2 
-mptr ptr_a, x, 4 
-mptr ptr_b, x, 4 
loop: .trip 20, 20 
idh *potr_att+, vall 
ldh *potr_bt+, val2 
mpy vall, val2, prodl 
add suml, prodl, suml 
ldh *otr_att+, val3 
ldh *potr_bt++, val4 
mpy val3, val4, prod2 
add sum2, prod2, sum2 
{cent] add =1, ent, cnt 
{cnt] b loop 


add suml, sum2, suml 


return suml 
-endproc 


<loop kernel generated> 


loop: ; PIPED LOOP KERNEL 
[!Al] ADD .L2 B4,B6,B4 
PY .M2X B8,A0,B6 

[ BO] B SL loop 
LDH .D2T2 *B5++ (4) ,B8 
LDH «DITL *-AR4(2),A0 

[ Al] SUB +S1 Al,1,Al1 

[!A1] ADD Pyle A5,A3,A5 
MPY .M1X B7,A0,A3 

[ BO] ADD ~L2 -1,B0,BO 
LDH .D2T2 *-B5 (2),B7 
LDH -D1T1 *R4++(4),A0 
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The above loop kernel has no memory bank conflicts in the case where ptr_a 
and pir_b point to the same bank. This means that you have to know how your 
data is aligned in C code before using the .mptr directive in your linear assem- 
bly code. The C6000 compiler supports pragmas in C/C++ that align your data 
to a particular boundary (DATA_ALIGN, for example). Use these pragmas to 
align your data properly, so that the .mptr directives work in your linear assem- 
bly code. 


5.2.5 The .trip Directive 


The .trip directive is analogous to the _must_ITERATE pragma for C/C++. The 
trip directive looks like: 


label: .trip minimum_value[, maximum value[, factor] ] 


For example if you wanted to say that the linear assembly loop will execute 
some minimum number of times, use the .trip directive with just the first para- 
meter. This example tells the assembly optimizer that the loop will iterate at 
least ten times. 


loop: .trip 10 


You can also tell the assembly optimizer that your loop will execute exactly 
some number of times by setting the minimum_value and maximum_value pa- 
rameters to exactly the same value. This next example tells the assembly opti- 
mizer that the loop will iterate exactly 20 times. 


loop: .trip 20, 20 


The maximum_value parameter can also tell the assembly optimizer that the 
loop will iterate between some range. The factor parameter allows the assem- 
bly optimizer to know that the loop will execute a factor of value times. For ex- 
ample, the next loop will iterate either 8, 16, 24, 32, 40, or 48 times when this 
particular linear assembly loop is called. 


loop: .trip 8, 48, 8 


The maximum_value and factor parameters are especially useful when your 
loop needs to be interruptible. Refer to section 8.4.4, Getting the Most Perfor- 
mance Out of Interruptible Code. 
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5.3 Writing Parallel Code 


One way to optimize linear assembly code is to reduce the number of execu- 
tion cycles in a loop. You can do this by rewriting linear assembly instructions 
so that the final assembly instructions execute in parallel. 


5.3.1 Dot Product C Code 


The dot product is a sum in which each element in array ais multiplied by the 
corresponding elementin array b. Each of these products is then accumulated 
into sum. The C code in Example 5-5 is a fixed-point dot product algorithm. 
The C code in Example 5-6 is a floating-point dot product algorithm. 


Example 5—5. Fixed-Point Dot Product C Code 


int dotp(short a[], short b[]) 
{ 


int sum, i; 
sum = 0; 


for(i=0; i<100; i++) 
sum += a[i] * b[il]; 


return(sum); 


Example 5-6. Floating-Point Dot Product C Code 


float dotp(float a[], float b[]) 
{ 
int i; 
float sum; 
sum = 0; 
for(i=0; i<100; i++) 
sum += a[i] * b[i]; 


return(sum); 
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5.3.2 Translating C Code to Linear Assembly 


The first step in optimizing your code is to translate the C code to linear assem- 
bly. 


5.3.2.1 Fixed-Point Dot Product 


Example 5—7 shows the linear assembly instructions used for the inner loop 
of the fixed-point dot product C code. 


Example 5—7. List of Assembly Instructions for Fixed-Point Dot Product 


LDH Du *A4++,A2 ; load ai from memory 

LDH sDL *A3++,A5 ; load bi from memory 

MPY -M1 A2,A5,A6 a aan ya 

ADD -L1 A6,A7,A7 7; sum += (ai * bi) 

SUB «OL Al,1,Al1 ; decrement loop counter 
[Al] B ~o2 LOOP ; branch to loop 


The load halfword (LDH) instructions increment through the a and b arrays. 
Each LDH does a postincrement on the pointer. Each iteration of these instruc- 
tions sets the pointer to the next halfword (16 bits) in the array. The ADD in- 
struction accumulates the total of the results from the multiply (MPY) instruc- 
tion. The subtract (SUB) instruction decrements the loop counter. 


An additional instruction is included to execute the branch back to the top of 
the loop. The branch (B) instruction is conditional on the loop counter, A1, and 
executes only until A1 is 0. 


5.3.2.2 Floating-Point Dot Product 


Example 5-8 shows the linear assembly instructions used for the inner loop 
of the floating-point dot product C code. 


Example 5—8. List of Assembly Instructions for Floating-Point Dot Product 


LDW DL *A4++,A2 ; load ai from memory 

LDW »D2 *A3++,A5 ; load bi from memory 

Mpyspt M1 A2,A5,A6 ; ai * bi 

ADDSPt lll A6,A7,A7 j sum += (ai * bi) 

SUB “ol! Al,1,Al1 ; decrement loop counter 
[Al] B ~S2 LOOP ; branch to loop 


t ADDSP and MPYSP are ’C67x (floating-point) instructions only. 


The load word (LDW) instructions increment through the aand barrays. Each 
LDW does a postincrement on the pointer. Each iteration of these instructions 
sets the pointer to the next word (32 bits) in the array. The ADDSP instruction 
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accumulates the total of the results from the multiply (MPYSP) instruction. The 
subtract (SUB) instruction decrements the loop counter. 


An additional instruction is included to execute the branch back to the top of 
the loop. The branch (B) instruction is conditional on the loop counter, A1, and 
executes only until A1 is 0. 


5.3.3 Linear Assembly Resource Allocation 


The following rules affect the assignment of functional units for Example 5—7 
and Example 5-8 (shown in the third column of each example): 


Load (LDH and LDW) instructions must use a .D unit. 
Multiply (MPY and MPYSP) instructions must use a .M unit. 
Add (ADD and ADDSP) instructions use a .L unit. 

Subtract (SUB) instructions use a .S unit. 

Branch (B) instructions must use a .S unit. 


OHOUUUU 


Note: 


The ADD and SUB can be onthe.§S, .L, or .D units; however, for Example 5—7 
and Example 5-8, they are assigned as listed above. 


The ADDSP instruction in Example 5—8 must use a .L unit. 
| | 


5.3.4 Drawing a Dependency Graph 


Dependency graphs can help analyze loops by showing the flow of instruc- 
tions and data in an algorithm. These graphs also show how instructions 
depend on one another. The following terms are used in defining a depen- 
dency graph. 


1) A node is a point on a dependency graph with one or more data paths 
flowing in and/or out. 


_j The path shows the flow of data between nodes. The numbers beside 
each path represent the number of cycles required to complete the instruc- 
tion. 


.) Aninstruction that writes to a variable is referred to as a parent instruction 
and defines a parent node. 


.) An instruction that reads a variable written by a parent instruction is re- 
ferred to as its child and defines a child node. 
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Use the following steps to draw a dependency graph: 


1) Define the nodes based on the variables accessed by the instructions. 
2) Define the data paths that show the flow of data between nodes. 

3) Add the instructions and the latencies. 

4) Add the functional units. 


5.3.4.1 Fixed-Point Dot Product 


Figure 5-1 shows the dependency graph for the fixed-point dot product 
assembly instructions shown in Example 5—7 and their corresponding register 
allocations. 


Figure 5—1. Dependency Graph of Fixed-Point Dot Product 


Instruction 
mnemonic Paeeu moe Functional 
unit 
Variable .D1 
being 
written 
5 MPY Register SUB 
Pa allocation 
Number of cycles 1 Ci) St 
required to complete : 


an instruction 
2 


(1 The two LDH instructions, which write the values of ai and bi, are parents 
of the MPY instruction. It takes five cycles for the parent (LDH) instruction 
to complete. Therefore, if LDH is scheduled on cycle i, then its child (MPY) 
cannot be scheduled until cycle i + 5. 


(1 The MPY instruction, which writes the product pi, is the parent of the ADD 
instruction. The MPY instruction takes two cycles to complete. 


(1 The ADD instruction adds pi (the result of the MPY) to sum. The output of 
the ADD instruction feeds back to become an input on the next iteration 
and, thus, creates a /oop carry path. (See section 5.7 on page 5-77 for 
more information on loop carry paths.) 
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The dependency graph for this dot product algorithm has two separate parts 
because the decrement of the loop counter and the branch do not read or write 
any variables from the other part. 


_) The SUB instruction writes to the loop counter, cntr. The output of the SUB 
instruction feeds back and creates a loop carry path. 


(J The branch (B) instruction is a child of the loop counter. 


5.3.4.2 Floating-Point Dot Product 


Similarly, Figure 5-2 shows the dependency graph for the floating-point dot 
product assembly instructions shown in Example 5—8 and their corresponding 
register allocations. 


Figure 5-2. Dependency Graph of Floating-Point Dot Product 


Instruction 
mnemonic » LDW Ee Functional 
unit 
Variable .D1 
being 
written 
5 Register SUB 
ye allocation 
Number of cycles 1 S1 
required to complete . 
an instruction ‘M1 


[J The two LDW instructions, which write the values of ai and bi, are parents 
of the MPYSP instruction. It takes five cycles for the parent (LDW) instruc- 
tion to complete. Therefore, if LDW is scheduled on cycle i, then its child 
(MPYSP) cannot be scheduled until cycle i + 5. 


_) The MPYSP instruction, which writes the product pi, is the parent of the 
ADDSP instruction. The MPYSP instruction takes four cycles to complete. 


.) The ADDSP instruction adds pi (the result of the MPYSP) to sum. The 
output of the ADDSP instruction feeds back to become an input on the next 
iteration and, thus, creates a /oop carry path. (See section 5.7 on page 
5-77 for more information on loop carry paths.) 
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The dependency graph for this dot product algorithm has two separate parts 
because the decrement of the loop counter and the branch do not read or write 
any variables from the other part. 


(1 The SUB instruction writes to the loop counter, cntr. The output of the SUB 
instruction feeds back and creates a loop carry path. 


(4 The branch (B) instruction is a child of the loop counter. 


5.3.5 Nonparallel Versus Parallel Assembly Code 


Nonparallel assembly code is performed serially, that is, one instruction follow- 
ing another in sequence. This section explains how to rewrite the instructions 
so that they execute in parallel. 


5.3.5.1 Fixed-Point Dot Product 


Example 5-9 shows the nonparallel assembly code for the fixed-point dot 
product loop. The MVK instruction initializes the loop counter to 100. The 
ZERO instruction clears the accumulator. The NOP instructions allow for the 
delay slots of the LDH, MPY, and B instructions. 


Executing this dot product code serially requires 16 cycles for each iteration 
plus two cycles to set up the loop counter and initialize the accumulator; 100 it- 
erations require 1602 cycles. 


Example 5-9. Nonparallel Assembly Code for Fixed-Point Dot Product 


LOOP: 


[Al] 


LDH 
LDH 
NOP 
MPY 
NOP 
ADD 
SUB 
B 
NOP 


5 


; Branch occurs here 


100, Al ; set up loop counter 
A7 ; zero out accumulator 
*A4++,A2 ; load ai from memory 
*A3++,A5 ; load bi from memory 

; Gelay slots for LDH 
A2,A5,A6 ; ai * bi 

; delay slot for MPY 
A6,A7,A7 7; sum += (ai * bi) 
Al1,1,Al1 ; decrement loop counter 
LOOP ; branch to loop 

, 


delay slots for branch 


Assigning the same functional unit to both LDH instructions slows perfor- 
mance of this loop. Therefore, reassign the functional units to execute the 
code in parallel, as shown in the dependency graph in Figure 5-3. The parallel 
assembly code is shown in Example 5—10. 
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Figure 5-3. Dependency Graph of Fixed-Point Dot Product with Parallel Assembly 


LDH 
Ca) ’ 


LDH 
Co) ™ 
5 


SUB 


S1 


Example 5-10. Parallel Assembly Code for Fixed-Point Dot Product 


MVK .S1 
|| ZERO Pei 
LOOP: 

LDH sD 
I] LDH .D2 

SUB eo 

[Al] B ~S2 
OP 2 
PY .M1X 
OP 
ADD .L1 


, 


100, Al 
Al 
*A4++,A2 
*B4++,B2 
Al,1,Al1 
LOOP 
A2,B2,A6 
A6,A7,A7 


Branch occurs here 


eT 


set up loop counter 
zero out accumulator 


load ai from memory 
load bi from memory 
decrement loop counter 
branch to loop 

delay slots for LDH 


ai. * bi 
delay slots for MPY 
sum += (ai * bi) 


Because the loads of ai and bi do not depend on one another, both LDH 
instructions can execute in parallel as long as they do not share the same 
resources. To schedule the load instructions in parallel, allocate the functional 


units as follows: 


_) ai and the pointer to ai to a functional unit on the A side, .D1 
1) bi and the pointer to bi to a functional unit on the B side, .D2 


Because the MPY instruction now has one source operand from A and one 
from B, MPY uses the 1X cross path. 
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Rearranging the order of the instructions also improves the performance of the 
code. The SUB instruction can take the place of one of the NOP delay slots 
for the LDH instructions. Moving the B instruction after the SUB removes the 
need for the NOP 5 used at the end of the code in Example 5-9. 


The branch now occurs immediately after the ADD instruction so that the MPY 
and ADD execute in parallel with the five delay slots required by the branch 
instruction. 


5.3.5.2 Floating-Point Dot Product 


Similarly, Example 5-11 shows the nonparallel assembly code for the floating- 
point dot product loop. The MVK instruction initializes the loop counter to 100. 
The ZERO instruction clears the accumulator. The NOP instructions allow for 
the delay slots of the LDW, ADDSP, MPYSP., and B instructions. 


Executing this dot product code serially requires 21 cycles for each iteration 
plus two cycles to set up the loop counter and initialize the accumulator; 100 it- 
erations require 2102 cycles. 


Example 5-11. Nonparallel Assembly Code for Floating-Point Dot Product 


LOOP: 


[Al] 


s 
B 


O 


U 


n 


; Branch occurs here 


100, Al ; set up loop counter 
A7 ; zero out accumulator 
*A4++,A2 load ai from memory 


*A3++,A5 ; load bi from memory 

; delay slots for LDW 
A2,A5,A6 ; ail * bi 

; delay slots for MPYSP 
A6,A7,A7 ; sum += (ai * bi) 

; delay slots for ADDSP 
Al,1,Al1 ; decrement loop counter 
LOOP ; branch to loop 

; Gelay slots for branch 


Assigning the same functional unit to both LDW instructions slows perfor- 
mance of this loop. Therefore, reassign the functional units to execute the 
code in parallel, as shown in the dependency graph in Figure 5—4. The parallel 
assembly code is shown in Example 5-12. 
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Figure 5-4. Dependency Graph of Floating-Point Dot Product with Parallel Assembly 


SUB 


S1 


Example 5—12. Parallel Assembly Code for Floating-Point Dot Product 


, 


MVK vol 
ZERO Pa eil 
LDW «Di 
LDW 2D2 
SUB ol 
OP 2 
[Al] B -o2 
PYSP .M1X 
OP 3 
ADDSP .L1 


100, Al 
Al 
*A4++,A2 
*B4++,B2 
Al,1,Al1 
LOOP 
A2,B2,A6 
A6,A7,A7 


Branch occurs here 


eT 


set up loop counter 
zero out accumulator 


load ai from memory 
load bi from memory 
decrement loop counter 
delay slots for LDW 
branch to loop 

ai, * ba 

delay slots for MPYSP 
sum += (ai * bi) 


Because the loads of ai and bi do not depend on one another, both LDW 
instructions can execute in parallel as long as they do not share the same 
resources. To schedule the load instructions in parallel, allocate the functional 


units as follows: 


_) ai and the pointer to ai to a functional unit on the A side, .D1 
1) bi and the pointer to bi to a functional unit on the B side, .D2 


Because the MPYSP instruction now has one source operand from A and one 
from B, MPYSP uses the 1X cross path. 
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Rearranging the order of the instructions also improves the performance of the 
code. The SUB instruction replaces one of the NOP delay slots for the LDW 
instructions. Moving the B instruction after the SUB removes the need for the 
NOP 5 used at the end of the code in Example 5—11 on page 5-16. 


The branch now occurs immediately after the ADDSP instruction so that the 
MPYSP and ADDSP execute in parallel with the five delay slots required by 
the branch instruction. 


Since the ADDSP finishes execution before the result is needed, the NOP 3 
for delay slots is removed, further reducing cycle count. 
5.3.6 Comparing Performance 


Executing the fixed-point dot product code in Example 5—10 requires eight 
cycles for each iteration plus one cycle to set up the loop counter and initialize 
the accumulator; 100 iterations require 801 cycles. 


Table 5—1 compares the performance of the nonparallel code with the parallel 
code for the fixed-point example. 


Table 5-1. Comparison of Nonparallel and Parallel Assembly Code for Fixed-Point 


Dot Product 
Code Example 100 Iterations Cycle Count 
Example 5-9 __ Fixed-point dot product nonparallel assembly 2+ 100 x 16 1602 
Example 5-10 Fixed-point dot product parallel assembly 1+100 x 8 801 


Executing the floating-point dot product code in Example 5—12 requires ten 
cycles for each iteration plus one cycle to set up the loop counter and initialize 
the accumulator; 100 iterations require 1001 cycles. 


Table 5-2 compares the performance of the nonparallel code with the parallel 
code for the floating-point example. 


Table 5—2. Comparison of Nonparallel and Parallel Assembly Code for Floating-Point 


Dot Product 
Code Example 100 Iterations Cycle Count 
Example 5-11 Floating-point dot product nonparallel assembly 2+ 100 x 21 2102 
Example 5-12 Floating-point dot product parallel assembly 1+100 x 10 1001 
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5.4 Using Word Access for Short Data and Doubleword Access for 


5.4.1 


Floating-Point Data 


The parallel code for the fixed-point example in section 5.3 uses an LDH 
instruction to read ali]. Because ali] and a[i+1] are next to each other in 
memory, you can optimize the code further by using the load word (LDW) 
instruction to read ali] and a[i+1] at the same time and load both into a single 
32-bit register. (The data must be word-aligned in memory.) 


In the floating-point example, the parallel code uses an LDW instruction to read 
a[i]. Because ali] and a[i+1] are next to each other in memory, you can opti- 
mize the code further by using the load doubleword (LDDW) instruction to read 
a[i] and a[i+1] at the same time and load both into a register pair. (The data 
must be doubleword-aligned in memory.) See the TMS320C6000 CPU and In- 
struction Set User’s Guide for more specific information on the LDDW instruc- 
tion. 


aaa | 


Note: 


The load doubleword (LDDW) instruction is available on the ’C64x (fixed 
point) and 'C67x (floating-point) device. 


|) 


Unrolled Dot Product C Code 


The fixed-point C code in Example 5-13 has the effect of unrolling the loop by 

accumulating the even elements, ali] and b[i], into sum0 and the odd elements, 

a[i+1] and b[i+ 1], into sum1. After the loop, sum0 and sum1 are added to pro- 
duce the final sum. The same is true for the floating-point C code in 

Example 5-14. (For another example of loop unrolling, see section 5.9 on 

page 5-94.) 


Example 5—13. Fixed-Point Dot Product C Code (Unrolled) 


int dotp(short a[], short b[] ) 
{ 


int sum0, suml, sum, i; 


0; 
0; 
for (i=0; i<100; it=2) { 
sum0O += a[i] * b[i]; 
suml += a[i + 1] * b[i + 1]; 


sum = sum0O + suml; 
return (sum); 
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Example 5-14. Floating-Point Dot Product C Code (Unrolled) 


float dotp(float a[], float b[]) 
{ 
int 3 
float sum0, suml, sum; 
sum0O = 0; 
suml = 0; 
for (i=0; i<100; it=2) { 
sum0 += a[i] * b[i]; 
suml += a[i + 1] * b[i +1]; 
} 
sum = sum0 + suml; 
return(sum); 
} 


5.4.2 Translating C Code to Linear Assembly 


The first step in optimizing your code is to translate the C code to linear assem- 
bly. 


5.4.2.1 Fixed-Point Dot Product 


Example 5—15 shows the list of C6000 instructions that execute the unrolled 
fixed-point dot product loop. Symbolic variable names are used instead of ac- 
tual registers. Using symbolic names for data and pointers makes code easier 
to write and allows the optimizer to allocate registers. However, you must use 
the .reg assembly optimizer directive. See the TMS320C6000 Optimizing 
C/C++ Compiler User’s Guide for more information on writing linear assembly 
code. 


Example 5-15. Linear Assembly for Fixed-Point Dot Product Inner Loop with LDW 


LDW 
LDW 
MPY 
MPYH 
ADD 
ADD 
[entr] SUB 
[centr] B 


*at++,ai_il ; load ai & al from memory 
*bot++,bi_il load bi & bl from memory 
ai_il,bi_il,pi ai * bi 

ai_il,bi_il,pil aitl * bitl 

pi, sum0, sum0 sum0 += (ai * bi) 


suml += (aitl * bi+1) 
decrement loop counter 
branch to loop 


pil,suml1,suml 
ener, 1 pence 
LOOP 


en 


The two load word (LDW) instructions load a{i], afi+1], b[i], and b[i+1] on each 
iteration. 
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Two MPY instructions are now necessary to multiply the second set of array 
elements: 


_j The first MPY instruction multiplies the 16 least significant bits (LSBs) in 
each source register: afi] x b[i]. 


.) The MPYH instruction multiplies the 16 most significant bits (MSBs) of 
each source register: a[i+1] x b [i+1]. 


The two ADD instructions accumulate the sums of the even and odd elements: 
sum0 and sum1. 


GS | 
Noie: 


This is true only when the ’C6x is in little-endian mode. In big-endian mode, 
MPY operates on a{i+1] and b[i+1] and MPYH operates on a[i] and b[i]. See 


the TMS320C6000 Peripherals Reference Guide for more information. 
| ee | 


5.4.2.2 Floating-Point Dot Product 


Example 5-16 shows the list of ’C6x instructions that execute the unrolled 
floating-point dot product loop. Symbolic variable names are used instead of 
actual registers. Using symbolic names for data and pointers makes code eas- 
ier to write and allows the optimizer to allocate registers. However, you must 
use the .reg assembly optimizer directive. See the TMS320C6000 Optimizing 
C/C++ Compiler User’s Guide for more information on writing linear assembly 
code. 


Example 5—16. Linear Assembly for Floating-Point Dot Product Inner Loop with LDDW 


fenty] 
[centr] 


LDDW 

LDDW 
MPYSP 
MPYSP 
ADDSP 
ADDSP 

SUB 

B 


*at++,ail:aid ; load a[it+0O] & a[i+l] from memory 
*bt++,bil:bi0 ; load b[it+0] & b[i+1l] from memory 
ai0,bi0,pi0 ; a[it0O] * b[it0] 

ail,bil,pil ; afitl] * b[it+l] 

pid, sum0, sum0 ; sum0O += (a[it+0O] * b[it+0]) 
pil,suml, suml ; suml += (a[it+l] * b[itl]) 
entr,1,cntr ; decrement loop counter 

LOOP ; branch to loop 


The two load doubleword (LDDW) instructions load a{i], a[i+1], b[i], and b[i+1] 
on each iteration. 


Two MPYSP instructions are now necessary to multiply the second set of array 
elements. 


The two ADDSP instructions accumulate the sums of the even and odd 
elements: sum0 and sum1. 
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5.4.3. Drawing a Dependency Graph 


The dependency graph in Figure 5—5 for the fixed-point dot product shows that 
the LDW instructions are parents of the MPY instructions and the MPY instruc- 
tions are parents of the ADD instructions. To split the graph between the A and 
B register files, place an equal number of LDWs, MPYs, and ADDs on each 
side. To keep both sides even, place the remaining two instructions, B and 
SUB, on opposite sides. 


Figure 5—5. Dependency Graph of Fixed-Point Dot Product With LDW 


A side ; B side 


Low LDW 


Similarly, the dependency graph in Figure 5-6 for the floating-point dot prod- 
uct shows that the LDDW instructions are parents of the MPYSP instructions 
and the MPYSP instructions are parents of the ADDSP instructions. To split 
the graph between the A and B register files, place an equal number of 
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LDDWs, MPYSPs, and ADDSPs on each side. To keep both sides even, place 
the remaining two instructions, B and SUB, on opposite sides. 


Figure 5-6. Dependency Graph of Floating-Point Dot Product With LDDW 


A side B side 
LDDW SF: LDDW 
| 
5 5 


5.4.4 Linear Assembly Resource Allocation 


After splitting the dependency graph for both the fixed-point and floating-point 
dot products, you can assign functional units and registers, as shown in the 
dependency graphs in Figure 5—7 and Figure 5-8 and in the instructions in 
Example 5-17 and Example 5-18. The .M1X and .M2X represent a path in the 
dependency graph crossing from one side to the other. 
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Figure 5-7. Dependency Graph of Fixed-Point Dot Product With LDW (Showing 
Functional Units) 


A side B side 
LDW LDW 
-D1 bi & bi+1 -D2 
5 5 5 


Example 5-17. Linear Assembly for Fixed-Point Dot Product Inner Loop With LDW 
(With Allocated Resources) 


LDW -D1 *A4++,A2 ; load ai and ait+l from memory 
LDW «DZ *B4++,B2 ; load bi and bit+l from memory 
MPY .MIX A2,B2,A6 j ai * bi 
MPYH .M2X A2,B2,B6 j aitl * bitl 
ADD G1 A6,A7,A7 ; sum0 += (ai * bi) 
ADD ~L2 B6,B7,B7 ; suml += (ait+l * bitl) 
SUB xa Al,1,Al1 ; decrement loop counter 
[Al] B «SZ LOOP ; branch to loop 
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Figure 5-8. Dependency Graph of Floating-Point Dot Product With LDDW (Showing 
Functional Units) 


A side B side 
LDDW LDDW 
-D1 bi&bi+1 ) -D2 
5 5 5 


Example 5-18. Linear Assembly for Floating-Point Dot Product Inner Loop With LDDW 
(With Allocated Resources) 


LDDW .D1 *A44++,A3:A2 ; load ai and ait+l from memory 
LDDW .D2 *B4++,B3:B2 ; load bi and bit+l from memory 
MPYSP .M1X A2,B2,A6 ; al. * bi 
MPYSP .M2X A3,B3,B6 ; ait+l * bitl 
ADDSP .L1 A6,A7,A7 ; sum0O += (ai * bi) 
ADDSP .L2 B6,B7,B7 ; suml += (ait+l * bitl1) 
SUB owl Al,1,Al1 ; decrement loop counter 
[Al] B .S2 LOOP ; branch to loop 
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5.4.5 Final Assembly 


Example 5—19 shows the final assembly code for the unrolled loop of the fixed- 
point dot product and Example 5—20 shows the final assembly code for the 
unrolled loop of the floating-point dot product. 


5.4.5.1 Fixed-Point Dot Product 


Example 5-19 uses LDW instructions instead of LDH instructions. 


Example 5-19. Assembly Code for Fixed-Point Dot Product With LDW 


(Before Software Pipelining) 


MVK «Si 50,Al ; set up loop counter 
{| ZERO Peeve A7 ; zero out sum0O accumulator 
i ZERO LZ B7 ; zero out suml accumulator 
LOOP 

LDW D1 *R4++,A2 ; load ai & ait+tl from memory 
lI LDW D2 *B4++,B2 ; load bi & bitl from memory 

SUB sol Al,1,Al1 ; decrement loop counter 
[Al] B Pow LOOP ; branch to loop 

NOP 

MPY M1X A2,B2,A6 j ai * bi 
|| MPYH M2X A2,B2,B6 j ait+l * bitl 

NOP 

ADD Pa A6,A7,A7 ; sum0+= (ai * bi) 
| | ADD -L2 B6,B7,B7 ; sumlt+= (ai+l * bitl) 

; Branch occurs here 

ADD ~L1X A7,B7,A4 ; sum = sum0 + suml 


The code in Example 5-19 includes the following optimizations: 


(J) The setup code for the loop is included to initialize the array pointers and 
the loop counter and to clear the accumulators. The setup code assumes 
that A4 and B4 have been initialized to point to arrays aand b, respectively. 


Li The MVK instruction initializes the loop counter. 


[1 The two ZERO instructions, which execute in parallel, initialize the even 
and odd accumulators (sum0 and sum1) to 0. 


(1 The third ADD instruction adds the even and odd accumulators. 
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5.4.5.2 Floating-Point Dot Product 


Example 5-20 uses LDDW instructions instead of LDW instructions. 


Example 5-20. Assembly Code for Floating-Point Dot Product With LDDW 
(Before Software Pipelining) 


[Al] 


ADDSP 
ADDSP 


; Branch occurs here 


NOP 


ADDSP 


NOP 


-S1 
-L1 
elie 


Di 


-D2 


“ol 


-S1 


.M1X 
.M2X 


3 


-L1 
22 


3 


.L1X 


50,Al ; set up loop counter 

Al ; zero out sum0 accumulator 
B7 ; zero out suml accumulator 
*A4++,A2 ; load ai & ait+tl from memory 
*B4++,B2 ; load bi & bit+l from memory 
Al,1,Al1 ; decrement loop counter 
LOOP ; branch to loop 

A2,B2,A6 saz * ba 

A3,B3,B6 ; aitl * bit+l 

A6,A7,A7 ; sum0 += (ai * bi) 

B6,B/,B7 ; suml += (aitl * bitl1) 
A7,B7,A4 ; sum = sum0 + suml 


The code in Example 5-20 includes the following optimizations: 


(J The setup code for the loop is included to initialize the array pointers and 
the loop counter and to clear the accumulators. The setup code assumes 
that A4 and B4 have been initialized to point to arrays aand b, respectively. 


The MVK instruction initializes the loop counter. 


_) The two ZERO instructions, which execute in parallel, initialize the even 
and odd accumulators (sum0 and sum1) to 0. 


_) The third ADDSP instruction adds the even and odd accumulators. 
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5.4.6 Comparing Performance 


Executing the fixed-point dot product with the optimizations in Example 5-19 
requires only 50 iterations, because you operate in parallel on both the even 
and odd array elements. With the setup code and the final ADD instruction, 100 
iterations of this loop require a total of 402 cycles (1 +8 x 50+ 1). 


Table 5-3 compares the performance of the different versions of the fixed- 
point dot product code discussed so far. 


Table 5-3. Comparison of Fixed-Point Dot Product Code With Use of LDW 


Code Example 100 Iterations Cycle Count 


Example 5-9 _ Fixed-point dot product nonparallel assembly 2+ 100 x 16 1602 
Example 5-10 Fixed-point dot product parallel assembly 1+100 x 8 801 
Example 5-19 Fixed-point dot product parallel assembly with LDW 1+ (50x 8)+1 402 


Executing the floating-point dot product with the optimizations in 
Example 5—20 requires only 50 iterations, because you operate in parallel on 
both the even and odd array elements. With the setup code and the final 
ADDSP instruction, 100 iterations of this loop require a total of 508 cycles (1 
+10 x 50+ 7). 


Table 5—4 compares the performance of the different versions of the floating- 
point dot product code discussed so far. 


Table 5-4. Comparison of Floating-Point Dot Product Code With Use of LDDW 


Code Example 100 Iterations Cycle Count 


Example 5-11 Floating-point dot product nonparallel assembly 2+ 100 x 21 2102 
Example 5-12 Floating-point dot product parallel assembly 1+100 x 10 1001 
Example 5-20 Floating-point dot product parallel assembly with LDDW 1+ (50x 10)+ 7 508 
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5.5 Software Pipelining 


This section describes the process for improving the performance of the as- 
sembly code in the previous section through software pipelining. 


Software pipelining is a technique used to schedule instructions from a loop 
so that multiple iterations execute in parallel. The parallel resources on the 
*C6x make it possible to initiate a new loop iteration before previous iterations 
finish. The goal of software pipelining is to start a new loop iteration as soon 
as possible. 


The modulo iteration interval scheduling table is introduced in this section as 
an aid to creating software-pipelined loops. 


The fixed-point dot product code in Example 5-19 needs eight cycles for each 
iteration of the loop: five cycles for the LDWs, two cycles for the MPYs, and one 
cycle for the ADDs. 


Figure 5—9 shows the dependency graph for the fixed-point dot product 
instructions. Example 5-21 shows the same dot product assembly code in 
Example 5—17 on page 5-24, except that the SUB instruction is now condition- 
al on the loop counter (A1). 


NE 
Note: 


Making the SUB instruction conditional on A1 ensures that A1 stops decre- 
menting when it reaches 0. Otherwise, as the loop executes five more times, 
the loop counter becomes a negative number. When A1 is negative, it is non- 
zero and, therefore, causes the condition on the branch to be true again. If the 
SUB instruction were not conditional on A1, you would have an infinite loop. 


a 
The floating-point dot product code in Example 5—20 needs ten cycles for each 
iteration of the loop: five cycles for the LDDWs, four cycles for the MPYSPs, 
and one cycle for the ADDSPs. 


Figure 5—10 shows the dependency graph for the floating-point dot product 
instructions. Example 5-22 shows the same dot product assembly code in 
Example 5—18 on page 5-25, except that the SUB instruction is now condition- 
al on the loop counter (A1). 


NN 
Note: 


The ADDSP has 3 delay slots associated with it. The extra delay slots are 
taken up by the LDDW, SUB, and NOP when executing the next cycle of the 
loop. Thus an NOP 3 is not required inside the loop but is required outside 


the loop prior to adding sum0 and sum1 together. 
ee 
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Figure 5-9. Dependency Graph of Fixed-Point Dot Product With LDW 
(Showing Functional Units) 
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Example 5-21. Linear Assembly for Fixed-Point Dot Product Inner Loop 
(With Conditional SUB Instruction) 


LDW -D1 *R4++,A2 ; load ai and ait+l from memory 
LDW -D2 *B4++,B2 ; load bi and bit+l from memory 
MPY .MIX A2,B2,A6 j ai * bi 
MPYH .M2X A2,B2,B6 j; aitl * bitl 
ADD pe A6,A7,A7 ; sum0O += (ai * bi) 
ADD -L2 B6,B7,B7 ; suml += (ait+tl * bi+l1) 

[Al] SUB _ Sil Aull Aad ; decrement loop counter 

[Al] B -S2 LOOP ; branch to top of loop 
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Figure 5-10. Dependency Graph of Floating-Point Dot Product With LDDW 
(Showing Functional Units) 
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Example 5-22. Linear Assembly for Floating-Point Dot Product Inner Loop 
(With Conditional SUB Instruction) 


LDDW .D1 *R4++,A2 ; load ai and ai+l from memory 
LDDW .D2 *B4++,B2 ; load bi and bi+l from memory 
MPYSP .M1X A2,B2,A6 ; al * bi 
MPYSP .M2X A2,B2,B6 ; ait+l * bitl 
ADDSP .L1 A6,A7,A7 ; sum0O += (ai * bi) 
ADDSP  .L2 B6,B7,B7 ; suml += (aitl * bitl) 

[Al] SUB geal Al1,1,Al1 ; decrement loop counter 

[Al] B Pict LOOP ; branch to top of loop 
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5.5.1 


Modulo Iteration Interval Scheduling 


Another way to represent the performance of the code is by looking at it ina 
modulo iteration interval scheduling table. This table shows how a 
software-pipelined loop executes and tracks the available resources on a 
cycle-by-cycle basis to ensure that no resource is used twice on any given 
cycle. The iteration interval of a loop is the number of cycles between the initia- 
tions of successive iterations of that loop. 


5.5.1.1 Fixed-Point Example 


The fixed-point code in Example 5—19 needs eight cycles for each iteration of 
the loop, so the iteration interval is eight. 


Table 5-5 shows a modulo iteration interval scheduling table for the fixed-point 
dot product loop before software pipelining (Example 5-19). Each row repre- 
sents a functional unit. There is a column for each cycle in the loop showing 
the instruction that is executing on a particular cycle: 

(4 LDWs on the .D units are issued on cycles 0, 8, 16, 24, etc. 

MPY and MPYH on the .M units are issued on cycles 5, 13, 21, 29, etc. 
ADDs on the .L units are issued on cycles 7, 15, 23, 31, etc. 

SUB on the .S1 unit is issued on cycles 1, 9, 17, 25, etc. 

B on the .S2 unit is issued on cycles 2, 10, 18, 24, etc. 


LL 
L] 
LL] 
L] 


Table 5—5. Modulo Iteration Interval Scheduling Table for Fixed-Point Dot Product 
(Before Software Pipelining) 


Unit / Cycle 
.D1 


0, 8, ... 
LDW 


1,9,... 2,10,... | 3,11,.. | 4,12,.. | 5,13,.. ) 6,14)... | 7, 15,.. 


.D2 


LDW 


M1 


MPY 


.M2 


MPYH 


L1 


ADD 


.L2 


ADD 


S1 


SUB 


S2 


B 


In this example, each unit is used only once every eight cycles. 
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5.5.1.2 Floating-Point Example 


The floating-point code in Example 5-20 needs ten cycles for each iteration 
of the loop, so the iteration interval is ten. 


Table 5-6 shows a modulo iteration interval scheduling table for the floating- 
point dot product loop before software pipelining (Example 5-20). Each row 
represents a functional unit. There is a column for each cycle in the loop show- 
ing the instruction that is executing on a particular cycle: 


(j} LDDWs on the .D units are issued on cycles 0, 10, 20, 30, etc. 

_) MPYSPs and on the .M units are issued on cycles 5, 15, 25, 35, etc. 
_) ADDSPs on the .L units are issued on cycles 9, 19, 29, 39, etc. 

_) SUB on the .S1 unit is issued on cycles 3, 13, 23, 33, etc. 

_} Bon the .S2 unit is issued on cycles 4, 14, 24, 34, etc. 


Table 5-6. Modulo Iteration Interval Scheduling Table for Floating-Point Dot Product 
(Before Software Pipelining) 


Unit / 
Cycle 


.D1 


0, 10, ... 


LDDW 


1, 11,... 


2, 12,... | 3, 13, ... | 4,14, ... | 5, 15, ... | 6, 16, ... | 7, 17, ... | 8, 18, ... | 9, 19,... 


.D2 


LDDW 


.M1 


MPYSP 


.M2 


MPYSP 


L4 


ADDSP 


.L2 


ADDSP 


St 


SUB 


S2 


B 


In this example, each unit is used only once every ten cycles. 
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5.5.1.3 Determining the Minimum Iteration Interval 


Software pipelining increases performance by using the resources more effi- 
ciently. However, to create a fully pipelined schedule, it is helpful to first deter- 
mine the minimum iteration interval. 


The minimum iteration interval of a loop is the minimum number of cycles you 
must wait between each initiation of successive iterations of that loop. The 
smaller the iteration interval, the fewer cycles it takes to execute a loop. 


Resources and data dependency constraints determine the minimum iteration 
interval. The most-used resource constrains the minimum iteration interval. 
For example, if four instructions in a loop all use the .S1 unit, the minimum it- 
eration interval is at least 4. Four instructions using the same resource cannot 
execute in parallel and, therefore, require at least four separate cycles to 
execute each instruction. 


With the SUB and branch instructions on opposite sides of the dependency 
graph in Figure 5—9 and Figure 5—10, all eight instructions use a different func- 
tional unit and no two instructions use the same cross paths (1X and 2X). 
Because no two instructions use the same resource, the minimum iteration in- 
terval based on resources is 1. 


TO 
Note: 


In this particular example, there are no data dependencies to affect the 
minimum iteration interval. However, future examples may demonstrate this 


constraint. 
oS 


5.5.1.4 Creating a Fully Pipelined Schedule 


Having determined that the minimum iteration interval is 1, you can initiate a 
new iteration every cycle. You can schedule LDW (or LDDW) and MPY (or 
MPYSP) instructions on every cycle. 


Fixed-Point Example 


Table 5-7 shows a fully pipelined schedule for the fixed-point dot product ex- 
ample. 


Software Pipelining 


Table 5—7. Modulo Iteration Interval Table for Fixed-Point Dot Product 


(After Software Pipelining) 


Loop Prolog as 
Unit / Cycle 0 1 2 3 4 5 6 7, 8, 9... 
LOW LDW Low LDW LOW LOW Low cow 
We LOW LDW Low LDW LOW LOW Low iow 
‘M1 MPY MPY mie 
‘M2 MPYH | MPYH nei 
Lu Gn 
L2 AGD 
ol SUB SUB SUB SUB SUB SUB aif 
a ; : : : a 7s 


Note: The asterisks indicate the iteration of the loop; shading indicates the single-cycle loop. 


The rightmost column in Table 5—7 is a single-cycle loop that contains the 
entire loop. Cycles 0-6 are loop setup code, or loop prolog. 


Asterisks define which iteration of the loop the instruction is executing each 
cycle. For example, the rightmost column shows that on any given cycle inside 
the loop: 


Lj The ADD instructions are adding data for iteration n. 

_) The MPY instructions are multiplying data for iteration n + 2 (**). 
_j The LDW instructions are loading data for iteration n + 7 (*******). 
_} The SUB instruction is executing for iteration n + 6 (******). 

Lj The B instruction is executing for iteration n + 5 (*****). 


In this case, multiple iterations of the loop execute in parallel in a software pipe- 
line that is eight iterations deep, with iterations n through n + 7 executing in par- 
allel. Fixed-point software pipelines are rarely deeper than the one created by 
this single-cycle loop. As loop sizes grow, the number of iterations that can 
execute in parallel tends to become fewer. 
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Floating-Point Example 


Table 5-8 shows a fully pipelined schedule for the floating-point dot product 
example. 


Table 5-8. Modulo Iteration Interval Table for Floating-Point Dot Product 


(After Software Pipelining) 


| Loop Prolog 
Unit / 
Cycle 0 1 2 3 4 5 6 7 8 9, 10, 11... 
D1 pow | Lopw | Lopw | Lopw | Lopw | Low Low | Lopw | Lopw | Lopw 
02 | Low | Lopw  Lopw | Lopw | Lopw | Lopw | Low | Lopw | Lopw | Lopw 
a upysp | Mpysp | pysp | mpvsP | MPYSP 
ae upysp | MPysp | pysp | mpvsP | MPYSP 
.L1 ADDSP 
.L2 ADDSP 
= sus | sus | suB SUB | SUB SUB SE 
ee : : * ae . = 
Note: The asterisks indicate the iteration of the loop; shading indicates the single-cycle loop. 


The rightmost column in Table 5-8 is a single-cycle loop that contains the 
entire loop. Cycles 0-8 are loop setup code, or loop prolog. 


Asterisks define which iteration of the loop the instruction is executing each 
cycle. For example, the rightmost column shows that on any given cycle inside 
the loop: 


a 


LL 
L] 
LL 
a 


The ADDSP insiructions are adding data for iteration n. 
The MPYSFP instructions are multiplying data for iteration n + 4 (****). 


The LDDW instructions are loading data for iteration n + 9 ( 
The SUB instruction is executing for iteration n + 6 
The B instruction is executing for iteration n + 5 ( 


(ar) 


as 


papaneese) 
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Te | 


Note: 


Since the ADDSP instruction has three delay slots associated with it, the re- 
sults of adding are staggered by four. That is, the first result from the ADDSP 
is added to the fifth result, which is then added to the ninth, and so on. The 
second result is added to the sixth, which is then added to the 10th. This is 
shown in Table 5-9. 


a) 


In this case, multiple iterations of the loop execute in parallel in a software pipe- 
line that is ten iterations deep, with iterations n through n + 9 executing in paral- 
lel. Floating-point software pipelines are rarely deeper than the one created 
by this single-cycle loop. As loop sizes grow, the number of iterations that can 
execute in parallel tends to become fewer. 


5.5.1.5 Staggered Accumulation With a Multicycle Instruction 


When accumulating results with an instruction that is multicycle (that is, has 
delay slots other than 0), you must either unroll the loop or stagger the results. 
When unrolling the loop, multiple accumulators collect the results so that one 
result has finished executing and has been written into the accumulator before 
adding the next result of the accumulator. If you do not unroll the loop, then the 
accumulator will contain staggered results. 


Staggered results occur when you attempt to accumulate successive results 
while in the delay slots of previous execution. This can be achieved without 
error if you are aware of what is in the accumulator, what will be added to that 
accumulator, and when the results will be written on a given cycle (such as the 
pseudo-code shown in Example 5-23). 


Example 5-23. Pseudo-Code for Single-Cycle Accumulator With ADDSP 


LOOP: ADDSP x, sum, Sum 
I | LDW *xptr++,x 
|| [cond] B cond 
| | [cond] SUB cond,1,cond 


Table 5-9 shows the results of the loop kernel for a single-cycle accumulator 
using a multicycle add instruction; in this case, the ADDSP, which has three 
delay slots (a 4-cycle instruction). 
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Table 5-9. Software Pipeline Accumulation Staggered Results Due to Three-Cycle 


Delay 
Current value of 
Cycle # Pseudoinstruction pseudoregister sum Written expected result 

0 ADDSP x(0), sum, sum 0 ; cycle 4 sum = x(0) 

1 ADDSP x(1), sum, sum 0 ; cycle 5 sum = x(1) 

2 ADDSP x(2), sum, sum 0 ; cycle 6 sum = x(2) 

3 ADDSP x(3), sum, sum 0 ; cycle 7 sum = x(3) 

4 ADDSP x(4), sum, sum x(0) ; cycle 8 sum = x(0) + x(4) 
5 ADDSP x(5), sum, sum x(1) ; cycle 9 sum = x(1) + x(5) 
6 ADDSP x(6), sum, sum x(6) ; cycle 10 sum = x(2) + x(6) 
A ADDSP x(7), sum, sum x(7) ; cycle 11 sum = x(3) + x(7) 
8 ADDSP x(8), sum, sum x(0) + x(4) ; cycle 12 sum = x(0) + x(8) 

i+jf ADDSP x(i+j), sum, sum X(j) + X(j+4) + X(j+8) ... x(i-4+4j) 5 cycle i+j+4sum = x(j) + x(j+4) + 


Tt where iis a multiple of 4 


X(j+8) ... x(i-4+4j) + x(i+]) 


The first value of the array x, x(0) is added to the accumulator (sum) on cycle 
0, but the result is not ready until cycle 4. This means that on cycle 1 when x(1) 
is added to the accumulator (sum), sum has no value in it from x(0). Thus, 
when this result is ready on cycle 5, sum will have the value x(1) in it, instead 
of the value x(0) + x(1). When you reach cycle 4, sum will have the value x(0) 
in it and the value x(4) will be added to that, causing sum = x(0) + x(4) on 
cycle 8. This is continuously repeated, resulting in four separate accumula- 
tions (using the register “sum”). 


The current value in the accumulator “sum” depends on which iteration is be- 
ing done. After the completion of the loop, the last four sums should be written 
into separate registers and then added together to give the final result. This 
is shown in Example 5-27 on page 5-43. 
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5.5.2 Using the Assembly Optimizer to Create Optimized Loops 


Example 5—24 shows the linear assembly code for the full fixed-point dot prod- 
uct loop. Example 5-25 shows the linear assembly code for the full floating- 
point dot product loop. You can use this code as input to the assembly optimiz- 
er tool to create software-pipelined loops automatically. See the 
TMS320C6000 Optimizing C/C++ Compiler User’s Guide for more informa- 
tion on the assembly optimizer. 


Example 5-24. Linear Assembly for Full Fixed-Point Dot Product 


-global _dotp 
_dotp: .cproc a, b 
.reg sum, sum0, suml,; contr 
.reg aii; bands, ‘ply, pil 
MVK 50,cntr ; centr = 100/2 
ZERO sum0 ; multiply result = 0 
ZERO suml ; multiply result = 0 
LOOP: trip 50 
LDW *at+t+,ai_il ; load ai & ait+l from memory 
LDW *ot++,bi_il ; load bi & bitl from memory 
MPY ai_il,bi_il,pi ; ai * bi 
MP YH ai_il,bi_il,pil; ait+l * bitl 
ADD pi, sum0, sum0 ; sum0O += (ai * bi) 
ADD pil,suml,suml ; suml += (aitl * bi+l1) 
[centr] SUB entr,1,cntr ; decrement loop counter 
[cntr] B LOOP ; branch to loop 
ADD sum0, Ssuml, sum ; compute final result 
.return sum 
-endproc 


Resources such as functional units and 1X and 2X cross paths do not have 
to be specified because these can be allocated automatically by the assembly 
optimizer. 
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Example 5-25. Linear Assembly for Full Floating-Point Dot Product 


-global _dotp 
_dotp: ~Gproc a, b 
.reg sum, sum0, suml, a, b 
-reg ai:ail, bi:bil, pi, pil 
MVK 50,cntr * entr = 10072 
ZERO sum0 ; multiply result = 0 
ZERO suml ; multiply result = 0 
LOOP: -Erip .50 
LDDW *katt,ai:ail ; load ai & aitl from memory 
LDDW *b++,bi:bil ; load bi & bit+tl from memory 
MPYSP a0,b0,pi ; al * bi 
MPYSP aul, 45 poael ; ait+l * bitl 
ADDSP pi,sum0, sum0 j sumO += (ai * bi) 
ADDSP pil,suml, suml j suml += (aitl * bi+tl1) 
[centr] SUB ontxr; L,-entr ; decrement loop counter 
[cntr] B LOOP ; branch to loop 
ADDSP sum, suml, sum0 ; compute final result 
.-return sum 
.endproc 


5.5.3 Final Assembly 


5-40 


Example 5-26 shows the assembly code for the fixed-point software-pipe- 
lined dot product in Table 5-7 on page 5-35. Example 5-27 shows the assem- 
bly code for the floating-point software-pipelined dot product in Table 5-8 on 
page 5-36. The accumulators are initialized to 0 and the loop counter is set up 
in the first execute packet in parallel with the first load instructions. The aster- 
isks in the comments correspond with those in Table 5—7 and Table 5-8, re- 
spectively. 


CCT 
Note: 


All instructions executing in parallel constitute an execute packet. An exe- 
cute packet can contain up to eight instructions. 


See the TMS320C6000 CPU and Instruction Set Reference Guide for more 


information about pipeline operation. 
— TTT.) 
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5.5.3.1 Fixed-Point Example 


Multiple branch instructions are in the pipe. The first branch in the fixed-point 
dot product is issued on cycle 2 but does not actually branch until the end of 
cycle 7 (after five delay slots). The branch target is the execute packet defined 
by the label LOOP. On cycle 7, the first branch returns to the same execute 
packet, resulting in a single-cycle loop. On every cycle after cycle 7, a branch 
executes back to LOOP until the loop counter finally decrements to 0. Once 
the loop counter is 0, five more branches execute because they are already 
in the pipe. 


Executing the dot product code with the software pipelining as shown in 
Example 5—26 requires a total of 58 cycles (7 + 50 + 1), which is a significant 
improvement over the 402 cycles required by the code in Example 5—19. 


aa aaa aa | 


Noie: 


The code created by the assembly optimizer will not completely match the 
final assembly code shown in this and future sections because different ver- 
sions of the tool will produce slightly different code. However, the inner loop 
performance (number of cycles per iteration) should be similar. 
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Example 5-26. Assembly Code for Fixed-Point Dot Product (Software Pipelined) 


LDW -D1 *R4++,A2 ; load ai & aitl from memory 
LDW -D2 *B4++,B2 ; load bi & bitl from memory 
MVK ook 50,Al ; set up loop counter 
ZERO «Lil A7 ; zero out sum0O accumulator 
ZERO «LZ B7 ; zero out suml accumulator 
[Al] SUB ao Al1,1,Al1 ; decrement loop counter 
LDW 2D1L *A4++,A2 ;* load ai & ait+tl from memory 
LDW .D2 *B4++,B2 ;* load bi & bitl from memory 
[Al] SUB «ol Al,1,Al1 7* decrement loop counter 
[A1] B «52 LOOP ; branch to loop 
LDW -D1 *R4++,A2 7** load ai & ait+tl from memory 
LDW -D2 *B4++,B2 7** load bi & bitl from memory 
[Al] SUB “oil: Al1,1,Al1 7** decrement loop counter 
[A1] B woe LOOP 7* branch to loop 
LDW -D1 *R4++,A2 7*** load ai & ait+l from memory 
LDW -D2 *B4++,B2 7*** load bi & bitl from memory 
[Al] SUB aol Al,1,Al1 7*** decrement loop counter 
[A1] B 282 LOOP 7** branch to loop 
LDW «Di *R4++,A2 7**** load ai & ait+tl from memory 
LDW -D2 *B4++,B2 7**** load bi & bitl from memory 
MPY .M1X A2,B2,A6 j ai * bi 
MPYH .M2X A2,B2,B6 j aitl * bitl 
[Al] SUB Peony Al,1,Al1 7**** decrement loop counter 
[Al] B Be LOOP 7*** branch to loop 
LDW «Di *A4++,A2 7***** Td ai & ait+tl from memory 
LDW -D2 *B4++,B2 7***k** Td bi & bitl from memory 
MPY .M1X A2,B2,A6 j* ai * bi 
MPYH .M2X A2,B2,B6 ;* aitl * bit+l 
[Al] SUB wok Al,1,Al1 7 **x*** decrement loop counter 
[Al] B Pes LOOP 7**** branch to loop 
LDW Pony *A4++,A2 7 ***k*** Td ai & aitl from memory 
LDW D2 *BA++,B2 7 xxx Td bi & bitl from memory 
LOOP 
ADD pel A6,A7,A7 ; sum0 += (ai * bi) 
I | ADD ~L2 B6,B7,B7 ; suml += (ait+l * bitl) 
|| MPY .M1X A2,B2,A6 7** ai * bi 
I | MPYH 2X A2,B2,B6 ;** aitl * bitl 
| | [Al] SUB ool Al,1,Al1 7 **x**** decrement loop counter 
|| [Al] B aoe LOOP 7***** branch to loop 
| LDW “Dal *A4++,A2 7 xxx Td ai & ait+l fm memory 
| LDW -D2 *B4++,B2 pxxxeeee Td bi & bit+tl fm memory 
; Branch occurs here 
ADD ey X A7,B7,A4 ; sum = sum0 + suml 
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5.5.3.2 Floating-Point Example 


The first branch in the floating-point dot product is issued on cycle 4 but does 
not actually branch until the end of cycle 9 (after five delay slots). The branch 
target is the execute packet defined by the label LOOP. On cycle 9, the first 
branch returns to the same execute packet, resulting in a single-cycle loop. On 
every cycle after cycle 9, a branch executes back to LOOP until the loop count- 
er finally decrements to 0. Once the loop counter is 0, five more branches 
execute because they are already in the pipe. 


Executing the floating-point dot product code with the software pipelining as 
shown in Example 5-27 requires a total of 74 cycles (9 + 50 + 15), whichis a 
significant improvement over the 508 cycles required by the code in 
Example 5-20. 


Example 5-27. Assembly Code for Floating-Point Dot Product (Software Pipelined) 


[Al] SUB 


od 50,Al1 ; set up loop counter 

-L1 A8 ; sum0 = 0 

.L2 B8 ; suml = 0 

<DiL A4++,A7:A6 ; load ai & ai + 1 from memory 

«D2 B4++,B7:B6 ; load bi & bi + 1 from memory 

~D1 A4++,A7:A6 ;* load ai & ai + 1 from memory 

«D2 B4++,B7:B6 ;* load bi & bi + 1 from memory 

«Di A4++,A7:A6 7;** load ai & ai + 1 from memory 

.D2 B4++,B7:B6 7** load bi & bi + 1 from memory 

.D1 A4++,A7:A6 ;*** load ai & ai + 1 from memory 

{D2 B4++,B7:B6 7;*** load bi & bi + 1 from memory 

souk Al,1,Al1 ; decrement loop counter 

DHL A4++,A7:A6 ;**** load ai & ai + 1 from memory 

«D2 B4++,B7:B6 7;**** load bi & bi + 1 from memory 

~82 LOOP 7 branch to loop 

sot Al,1,Al1 7* decrement loop counter 

-D1 A4++,A7:A6 7***** load ai & ai + 1 from memory 

.D2 B4++,B7:B6 7**x*** load bi & bi + 1 from memory 
1X A6,B6,A5 ; pi = a0 b0 
2X A7,B7,B5 ; pil = al bl 

SZ LOOP 7* branch to loop 

od Al,1,Al1 7** decrement loop counter 

.D1 A4++,A7:A6 7******x Load ai & ai + 1 from memory 

.D2 B4++,B7:B6 7****** Load bi & bi + 1 from memory 
1X A6,B6,A5 7;* pi = ad b0 

.M2X A7,B7,B5 ;* pil = al bl 

<o2 LOOP 7** branch to loop 

sS1 Al,1,Al1 7;*** decrement loop counter 
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Example 5-27. Assembly Code for Floating-Point Dot Product (Software Pipelined) 


(Continued) 

LDDW .D1 A4++,A7:A6 7*****e* Load ai & ai + 1 from memory 
LDDW D2 B4++,B7:B6 7*****k* Load bi & bi + 1 from memory 
MPYSP 1X A6,B6,A5 7** pi = ad b0 
MPYSP -M2X  A7,B7,B5 7** pil = al bil 

[Al] B .S2 LOOP 7*** branch to loop 

[Al] SUB » Sak Al,1,Al1 ;**** decrement loop counter 
LDDW .D1 A4++,A7:A6 7 xxx Toad ai & ai + 1 from memory 
LDDW D2 B4++,B7:B6 7 xxx Load bi & bi + 1 from memory 
MPYSP xX A6,B6,A5 7*** pi = ad b0 
MPYSP -M2X  A7,B7,B5 7*** pil = al bl 

[Al] B »S2 LOOP 7; **** branch to loop 

[Al] SUB Si Al,1,Al1 7 **x*** decrement loop counter 

LOOP 

LDDW .D1 A4++,A7:A6 7 xxx Load ai & ai + 1 from memory 
LDDW <D2 B4++,B7:B6 px Load bi & bi + 1 from memory 
MPYSP .M1X A6,B6,A5 ;**** pi = a0 b0 
MPYSP M2X  A7,B7,B5 p**** pil = al bl 
ADDSP Peal A5,A8,A8 ; sum0O += (ai bi) 
ADDSP «i2 B5,B8,B8 ;suml += (ait+l bitl) 

[Al] B »S2 LOOP 7***** branch to loop 

[Al] SUB #oL Al,1,Al1 7 x*x*x*** decrement loop counter 

; Branch occurs here 

ADDSP LIX A8,B8,A0 ; sum(0) = sum0(0) + sumil (0) 
ADDSP L2X A8,B8,BO ; sum(1) = sum0(1) suml1 (1) 
ADDSP L1X A8,B8,A0 ; sum(2) = sum0 (2) suml (2) 
ADDSP L2X A8,B8,BO ; sum(3) = sum0 (3) suml1 (3) 
NOP ; wait for BO 
ADDSP -L1X AO,BO,A5 ; sum(01) = sum(0) sum(1) 
NOP ; wait for next BO 
ADDSP »L2X A0O,BO,B5 ; sum(23) = sum(2) sum (3) 
NOP 3 
ADDSP «LIX A5,B5,A4 ; sum = sum(01) + sum(23) 
NOP 3 r 
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5.5.3.3 Removing Extraneous Instructions 


The code in Example 5—26 and Example 5—27 executes extra iterations of 
some of the instructions in the loop. The following operations occur in parallel 
on the last cycle of the loop in Example 5-26: 


_j Iteration 50 of the ADD instructions 
1) Iteration 52 of the MPY and MPYH instructions 
Lj Iteration 57 of the LDW instructions 


The following operations occur in parallel on the last cycle of the loop in 
Example 5-27: 


1] Iteration 50 of the ADDSP instructions 
_] Iteration 54 of the MPYSP instructions 
_) Iteration 59 of the LDDW instructions 


In most cases, extra iterations are not a problem; however, when extraneous 
LDWs and LDDWs access unmapped memory, you can get unpredictable re- 
sults. If the extraneous instructions present a potential problem, remove the 
extraneous load and multiply instructions by adding an epilog like that included 
in the second part of Example 5-28 on page 5-47 and Example 5-29 on 
page 5-48. 


Fixed-Point Example 


To eliminate LDWs in the fixed-point dot product from iterations 51 through 57, 
run the loop seven fewer times. This brings the loop counter to 43 (50 — 7), 
which means you still must execute seven more cycles of ADD instructions 
and five more cycles of MPY instructions. Five pairs of MPYs and seven pairs 
of ADDs are now outside the loop. The LDWs, MPYs, and ADDs all execute 
exactly 50 times. (The shaded areas of Example 5—28 indicate the changes 
in this code.) 


Executing the dot product code in Example 5-28 with no extraneous LDWs 
still requires a total of 58 cycles (7 + 43 + 7 + 1), but the code size is now larg- 
er. 


Floating-Point Example 


To eliminate LDDWs in the floating-point dot product from iterations 51 through 
59, run the loop nine fewer times. This brings the loop counter to 41 (50 — 9), 
which means you still must execute nine more cycles of ADDSP instructions 
and five more cycles of MPYSP instructions. Five pairs of MPYSPs and nine 
pairs of ADDSPs are now outside the loop. The LDDWs, MPYSPs, and 


Optimizing Assembly Code via Linear Assembly 5-45 


Software Pipelining 


ADDSPs all execute exactly 50 times. (The shaded areas of Example 5-29 in- 
dicate the changes in this code.) 


Executing the dot product code in Example 5—29 with no extraneous LDDWs 
still requires a total of 74 cycles (9+ 41 +9 +15), but the code size is now larg- 
er. 


Example 5-28. Assembly Code for Fixed-Point Dot Product (Software Pipelined 
With No Extraneous Loads) 


LDW .D1 *R4++,A2 ; load ai & ait+tl from memory 
LDW ~D2 *B4++,B2 ; load bi & bitl from memory 
MVK on 43,Al1 ; set up loop counter 
ZERO ~L1 A7 ; zero out sum0O accumulator 
ZERO ~L2 B7 ; zero out suml accumulator 
[Al] SUB «od Al1,1,Al1 ; decrement loop counter 
LDW sDL *R4++,A2 ;* load ai & ait+tl from memory 
LDW ~D2 *B4++,B2 ;* load bi & bitl from memory 
[Al] SUB «ol Al1,1,Al1 ;* decrement loop counter 
[Al] B oe LOOP ; branch to loop 
LDW -D1 *R4++,A2 7** load ai & ait+tl from memory 
LDW sD2 *B4++,B2 7** load bi & bit+l from memory 
[Al] SUB -S1 Al1,1,Al1 7** decrement loop counter 
[Al] B woe LOOP 7* branch to loop 
LDW -D1 *A4++,A2 7*** load ai & ait+tl from memory 
LDW .D2 *B4++,B2 7*** load bi & bitl from memory 
Al] SUB 20 Al,1,Al1 7*** decrement loop counter 
[Al] B 282 LOOP 7** branch to loop 
LDW DL *R4++,A2 7**** load ai & ait+tl from memory 
LDW ~D2 *B4++,B2 7**** load bi & bit+tl from memory 
MPY .M1X A2,B2,A6 j ai * bi 
MPYH .M2X A2,B2,B6 j; aitl * bitl 
[Al] SUB Sl Al,1,Al1 7 **** decrement loop counter 
[Al] B »S2 LOOP 7*** branch to loop 
LDW -D1 *A4++,A2 7***** Td ai & aitl from memory 
LDW «D2 *B4++,B2 7***k*k* Td bi & bitl from memory 
MPY .M1X A2,B2,A6 j* ai * bi 
MPYH .M2X A2,B2,B6 ;* aitl * bit+l 
[Al] SUB Peon Al,1,Al1 7 **x*** decrement loop counter 
[Al] B Pees LOOP 7**** branch to loop 
LDW ~DL *A44+4+,A2 7 ****e* Td ai & aitl from memory 
LDW -D2 *B4++,B2 7 xxx Td bi & bitl from memory 
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Example 5-28. Assembly Code for Fixed-Point Dot Product (Software Pipelined 
With No Extraneous Loads) (Continued) 


LOOP 
ADD a el A6,A7,A7 ; sum0 += (ai * bi) 
| ADD +2 B6,B7,B7 7; suml += (aitl * bitl) 
| MPY .M1X A2,B2,A6 i** ai * bi 
| MPYH .M2X A2,B2,B6 ;** aitl * bitl 
| [Al] SUB -Sl Al,1,Al1 ;xx*x*x*** decrement loop counter 
| [Al] B «52 LOOP 7; ***** branch to loop 
| LDW .D1 *A4++,A2 7 xxxxee* Td ai & aitl fm memory 
| LDW .D2 *B4++,B2 7 xxxxee* Td bi & bitl fm memory 
; Branch occurs here 
ADDs MPYs 
ADD atid A6,A7,A7 ; sum0 += (ai * bi) @) 
ADD ol BiG Swi oal ; suml += (aitl * bi+1) 
MPY -M1X A2,B2,A6 pers gial %) Toyab @) 
MPYH .M2X A2,B2,B6 7** aitl * bitl 
ADD a lig A6,A7,A7 ; sum0 += (ai * bi) @) 
ADD wale, Biya oe > suml += (aa+il * bit) 
MPY -M1X A2,B2,A6 pesos ial te Joyal @) 
MPYH .M2X A2,B2,B6 ;** aitl * bitl 
ADD Aiea A6,A7,AT7 ; sum0 += (ai * bi) @B) 
ADD oly) Bio eie ou ; suml += (aitl * bit1) 
MPY .-M1X A2,B2,A6 pests gia 7 Joyal @) 
MP YH .M2X A2,B2,B6 7** ait] * bitl 
ADD site A6,A7,A7 ; sum0 += (ai * bi) @ 
ADD male) Boy Sibel >» suml += (aa+il * bili!) 
MPY .M1X A2,B2,A6 poses gal 7 Teal @) 
MPYH .M2X A2,B2,B6 ;** aitl * bitl 
ADD ait A6,A7,AT7 ; sum0 += (ai * bi) @) 
ADD dha?! BiG reise: > sumil += (adil * babi) 
MPY .M1X A2,B2,A6 pests Gla %e Te)al 6) 
MPYH .M2X A2,B2,B6 ;** aitl * bitl 
ADD ait A6,A7,AT7 ; sum0 += (ai * bi) ©) 
i) ADD oe BiG esa ; suml += (aitl * bit+1) 
ADD 5 Jill A6,A7,A7 ; sum0 += (ai * bi) @ 
Hil ADD lly? BiG yee se > suml += (ad+il * biti) 
ADD .L1X A7,B7,A4 ; sum = sum0O + suml 
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Example 5-29. Assembly Code for Floating-Point Dot Product (Software Pipelined 
With No Extraneous Loads) 


MVK Sil ANA INA 
ZERO G1 A8 
ZERO ~L2 B8 
LDDW .D1 A4++,A7:A6 
LDDW .D2 B4++,B7:B6 
LDDW .D1 A4++,A7:A6 
LDDW D2 B4++,B7:B6 
LDDW D1 A4++,A7:A6 
LDDW .D2 B4++,B7:B6 
LDDW .D1 A4++,A7:A6 
LDDW .D2 B4++,B7:B6 

[Al] SUB .S1 Al dA 
LDDW D1 A4++,A7:A6 
LDDW .D2 B4++,B7:B6 

Al] B .82 LOOP 

[Al] SUB .S1 Al,1,Al 
LDDW .D1 A4++,A7:A6 
LDDW .D2 B4++,B7:B6 
MPYSP 1X A6,B6,A5 
MPYSP 2X  AT7,B7,B5 

Al] B .S2 LOOP 

[Al] SUB .S1 Al,1,Al1 
LDDW .D1 A4++,A7:A6 
LDDW .D2 B4++,B7:B6 
MPYSP 1X A6,B6,A5 
MPYSP 2X  AT7,B7,B5 

[Al] B S2 LOOP 

[Al] SUB .S1 Al,1,Al1 
LDDW .D1 A4++,A7:A6 
LDDW D2 B4++,B7:B6 
MPYSP 1X  A6,B6,A5 
MPYSP 2X  AT7,B7,B5 

[Al] B .S2 LOOP 

[Al] SUB .S1 Aly 1, Al 
LDDW .D1 A4++,A7:A6 
LDDW D2 B4++,B7:B6 
MPYSP M1X A6,B6,A5 
MPYSP M2X  A7,B7,B5 

Al] B S2 LOOP 

Al] SUB Sl A1,1,A1 


Sock up loops counters 


; sum0 = 0 

; suml = 0 

; load ai & ai + 1 from memory 

; load bi & bi + 1 from memory 

7* load ai & ai + 1 from memory 
;* load bi & bi + 1 from memory 
7** load ai & ai + 1 from memory 
7** load bi & bi + 1 from memory 
;*** load ai & ai + 1 from memory 
7*** load bi & bi + 1 from memory 


; decrement loop counter 


7**** load ai & ai + 1 from memory 
;**** load bi & bi + 1 from memory 
7 branch to loop 

7* decrement loop counter 


7***** Load ai & ai + 1 from memory 
7***** load bi & bi + 1 from memory 
; pi = ad bo 

; pil = al bl 

7* branch to loop 

7** decrement loop counter 


7****** Load ai & ai + 1 from memory 
7****** Load bi & bi + 1 from memory 
7* pi = a0 b0 

7* pil = al bil 

7** branch to loop 

;*** decrement loop counter 


7 ***k**K* Load ai & ai + 1 from memory 
p*xxekxx Load bi & bi + 1 from memory 
7** pi = a0 b0 

7** pil = al bil 

7*** branch to loop 

7;**** decrement loop counter 


pe***Keee* Load ai & ai + 1 from memory 
p*xeeeK*K*K Load bi & bi + 1 from memory 
7*** pi = a0 b0 

7*** pil = al bl 

7**** branch to loop 

7***** decrement loop counter 
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Example 5-29. Assembly Code for Floating-Point Dot Product (Software Pipelined 
With No Extraneous Loads) (Continued 


eS 


pres 


Pres PPRPES 


PYS 
BMS 
DDS 


PYS 
BMS, 
DDS 


PYS 
PYS 
DDS 


PY¥S 
PYS 
DDS 


PYS 
BS 
DDS 


DDS 


DDSF 


DDSF 


DDSF 


DDSF 


DDSF 


DDSP 


DDSF 


DDSF 


DDSP 
DDSF 


DDSP 


os, i ee eee ee See Sr  , 


U 


DDSF 


U 


Sl 


A4++,A7:A6 
B4++,B7:B6 
A6,B6,A5 
A7,B7,B5 
A5,A8,A8 
B5,B8,B8 
LOOP 
Al,1,Al 


Ao,B6,A5 
A7,B7,B5 
A5,A8,A8 
B5,B8,B8 


A6,B6,A5 
A7,B7,B5 
A5,A8,A8 
B5,B8,B8 


Ao,B6,A5 
PNT 18) lbs) 
A5,A8,A8 
B5,B8,B8 


Ao,B6,A5 
PRT 18) ABs) 
A5,A8,A8 
B5,B8,B8 


A6,B6,A5 
A7,B7,B5 
A5,A8,A8 
Bo, BeBe 


A5,A8,A8 
B5,B8,B8 


A5,A8,A8 
B5,B8,B8 


A5,A8,A8 
B5,B8,B8 


A5,A8,A8 
B5,B8,B8 


7 *xKKKKKKK Load ai & ai + 1 from memory 


7 xxxxxkeKKK Load bi & bi + 1 from memory 


lalla pi = a0 
PRA OI. = sal 
; sum0 += (ai 
; suml += (aitl 


7 **x**** decrement loop counter 


; pi = a0 b0 
p joel = aul Jeyil 


; sum0 += (ai 

; suml += (aitl 
; pi = a0 b0 

Rp joakd = aul depth 
; sum0 += (ai 

; suml += (aitl 
; pi = a0 b0 

fF jen = eu  Joyil 
; sum0 += (ai 
simile += (alt 
; pi = a0 b0 

f joa = el Joyil 
; sum0 += (ai 

; suml += (aitl 
; pi = a0 b0 

e joskil = gil ieill 
; sum0 += (ai 

; suml += (aitl 
; sum0 += (ai 

; suml += (aitl 
; sum0 += (ai 

; suml += (aitl 
; sum0 += (ai 

; suml += (aitl 
; sum0 += (ai 

; suml += (aitl 


bO 
bl 
bi) 


bit) 
7 ***** branch to loop 


bi) 


Isibaril)) 


bi) 


Iota il) 


bi) 


lotaril)) 


bi) 


lotr il) 


bi) 
bid 
bi) 
bid 
bi) 
bid 
bi) 
bid 
bi) 
bid 


ADDSPs 


MPYSPs 


@ 
© 


© 
® 


c 


© ea Bor fem 
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Example 5-29. Assembly Code for Floating-Point Dot Product (Software Pipelined 
With No Extraneous Loads) (Continued) 


ADDSP ebixX A8,B8,A0 ; sum(0) = sum0(0) + suml (0) 
ADDSP ~L2xX A8,B8,BO ; sum(1) = sum0(1) + sumil (1) 
ADDSP - LIX A8,B8,A0 ; sum(2) = sum0(2) + suml (2) 
ADDSP ~LZ2X A8,B8,BO ; sum(3) = sum0(3) + sumil (3) 
NOP ; wait for BO 

ADDSP ebLX AO,BO,A5 ; sum(01) = sum(0) + sum(1) 
NOP j; wait for next BO 

ADDSP »L2X AO,BO,B5 ; sum(23) = sum(2) + sum(3) 
NOP 3 

ADDSP na alle 6 A5,B5,A4 7; sum = sum(01) + sum(23) 
NOP 3 ; 
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5.5.3.4 Priming the Loop 


Although Example 5—28 and Example 5—29 execute as fast as possible, the 
code size can be smaller without significantly sacrificing performance. To help 
reduce code size, you can use a technique called priming the loop. Assuming 
that you can handle extraneous loads, start with Example 5-26 or 
Example 5-27, which do not have epilogs and, therefore, contain fewer 
instructions. (This technique can be used equally well with Example 5—28 or 
Example 5-29.) 


Fixed-Point Example 


To eliminate the prolog of the fixed-point dot product and, therefore, the extra 
LDW and MPY instructions, begin execution at the loop body (at the LOOP 
label). Eliminating the prolog means that: 


_) Two LDWs, two MPYs, and two ADDs occur in the first execution cycle of 
the loop. 


_j Because the first LDWs require five cycles to write results into a register, 
the MPYs do not multiply valid data until after the loop executes five times. 
The ADDs have no valid data until after seven cycles (five cycles for the 
first LDWs and two more cycles for the first valid MPYs). 


Example 5-30 shows the loop without the prolog but with four new instructions 
that zero the inputs to the MPY and ADD instructions. Making the MPYs and 
ADDs use Os before valid data is available ensures that the final accumulator 
values are unaffected. (The loop counter is initialized to 57 to accommodate 
the seven extra cycles needed to prime the loop.) 


Because the first LDWs are not issued until after seven cycles, the code in 
Example 5-30 requires a total of 65 cycles (7 + 57+ 1). Therefore, you are re- 
ducing the code size with a slight loss in performance. 
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Example 5-30. Assembly Code for Fixed-Point Dot Product (Software Pipelined With 
Removal of Prolog and Epilog) 


MVK .S1 
[Al] SUB S1 
ZERO ig 
ZERO 2 
[Al] SUB Si 
[Al] B S2 
ZERO 1 
ZERO 12 
[Al] SUB S1 
[Al] B S2 
ZERO ‘0 
ZERO 12 
Al] SUB Sl 
Al] B S2 
Al] SUB -oU 
Al] B S2 
Al] SUB S1 
Al] B S2 
LOOP 
ADD L1 
| | ADD «2 
| | MPY .M1X 
| | MPYH M2X 
| | [Al] SUB Si 
|| {A1] B .S2 
| | LDW D1 
|| LDW .D2 
ADD 


57,Al 


Al,1,Al 
Al 
B7 


Al,1,Al1 
LOOP 

A6 

B6é 


Al,1,Al 
LOOP 

A2 

B2 


Al,1,Al 
LOOP 


Al,1,Al1 
LOOP 


Al,1,Al 
LOOP 


AG6,AT7,A7 
B6,B7,B7 
A2,B2,A6 
A2,B2,B6 
Al,1,Al 
LOOP 
*R4++,A2 
*B4++,B2 


; Branch occurs here 


’ 


, 


, 


set up loop counter 


decrement loop counter 
zero out sum0O accumulator 
zero out suml accumulator 


* decrement loop counter 
branch to loop 

zero out add input 

zero out add input 


**x decrement loop counter 


;* branch to loop 


zero out mpy input 
zero out mpy input 


7*** decrement loop counter 


** branch to loop 


7**** decrement loop counter 


1 


;*** branch to loop 


7***** decrement loop counter 
7**** branch to loop 


, 
’ 


, 


;** ai * bi 


sum0O += (ai * bi) 
suml += (aitl * bi+l1) 


pee aaah eae 

7****** decrement loop counter 
;***** branch to loop 

p xxx Td ai & ai+l fm memory 
7 xxx Td bi & bit+l fm memory 


.L1X A7,B7,A4 


7 sum = sum0O + suml 
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Floating-Point Example 


To eliminate the prolog of the floating-point dot product and, therefore, the 
extra LDDW and MPYSP instructions, begin execution at the loop body (at the 
LOOP label). Eliminating the prolog means that: 


_} Two LDDWs, two MPYSPs, and two ADDSPs occur in the first execution 
cycle of the loop. 


_} Because the first LDDWs require five cycles to write results into a register, 
the MPYSPs do not multiply valid data until after the loop executes five 
times. The ADDSPs have no valid data until after nine cycles (five cycles 
for the first LDDWs and four more cycles for the first valid MPYSPs). 


Example 5-31 shows the loop without the prolog but with four new instructions 
that zero the inputs to the MPYSP and ADDSP instructions. Making the 
MPYSPs and ADDSPs use Os before valid data is available ensures that the 
final accumulator values are unaffected. (The loop counter is initialized to 59 
to accommodate the nine extra cycles needed to prime the loop.) 


Because the first LDDWs are not issued until after nine cycles, the code in 
Example 5—31 requires a total of 81 cycles (7 + 59+ 15). Therefore, you are 
reducing the code size with a slight loss in performance. 


Example 5—31. Assembly Code for Floating-Point Dot Product (Software Pipelined With 
Removal of Prolog and Epilog) 


MVK -S1 59,Al1 ; set up loop counter 
ZERO L A7 ; zero out mpysp input 
ZERO L2 B7 ; zero out mpysp input 
{Al] SUB Al,1,Al1 ; decrement loop counter 
[Al] B {OZ LOOP ; branch to loop 
{Al] SUB .S1 Al1,1,Al1 ;* decrement loop counter 
ZERO dy A8 ; zero out sum0 accumulator 
ZERO -L2 B8 ; zero out sum0 accumulator 
[Al] B woe LOOP 7* branch to loop 
[Al] SUB -Sl1 Al1,1,Al1 7** decrement loop counter 
ZERO -L1 A5 ; zero out addsp input 
ZERO -L2 B5 ; zero out addsp input 
[Al] B $2 LOOP 7** branch to loop 
{Al] SUB S1 Al,1,Al1 ;*** decrement loop counter 
ZERO Ll A6 ; zero out mpysp input 
ZERO L2 B6 ; zero out mpysp input 
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Example 5-31. Assembly Code for Floating-Point Dot Product (Software Pipelined With 
Removal of Prolog and Epilog) (Continued) 


[Al] 
[Al] 
Branch 


B ~S2 
SUB Sil 
B 2O2 
SUB «Oil 
LDDW «Dil 
LDDW ~D2 
MPYSP -M1X 
MPYSP ~M2X 
ADDSP LL 
ADDSP »L2 
B ~S2 
SUB «oll 
occurs here 
ADDSP L1X 
ADDSP L2Xx 
ADDSP -L1X 
ADDSP »L4X 
NOP 

ADDSP L1x 
NOP 

ADDSP »L2X 
NOP 

ADDSP L1x 
NOP 


LOOP 
Al,1,Al 


A4++,A7:A6 
B4++,B7:B6 
A6,B6,A5 
A7,B7,B5 
A5,A8,A8 
B5,B8,B8 
LOOP 
Al,1,Al1 


A8,B8,A0 
A8,B8,BO 
A8,B8,A0 


A8,B8,BO 


AO,BO,A5 


AO,BO,B5 
3 
A5,B5,A4 


3 


, 


*** branch to loop 


;**** decrement loop 


ti 


**x** branch to loop 


7***** decrement loop 


pReeeeeeKK Load al & 
7 xxxxxaKK*X Load bi & bi + 1 from memory 
7**** pi = a0 b0 


CRE R pil 


, 


, 


=al bl 
sum0 += (ai bi) 
suml += (aitl bi+l 


7***** branch to loop 
;xxxx** decrement loop counter 


' 


tA 


, 


i 


, 


, 


sum(0) = sum0(0) + 
sum(1) = sum0(1) + 
sum(2) = sum0(2) + 
sum(3) = sum0(3) + 


wait for BO 


sum(01) = sum(0) + 


wait for next BO 


sum(23) = sum(2) + 


counter 


counter 


ai + 1 from memory 


) 


suml (0) 


suml (1) 


suml (2) 


suml (3) 


sum (1) 


sum (3) 


sum = sum(01) + sum(23) 
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5.5.3.5 Removing Extra SUB Instructions 


Example 5-32. 


To reduce code size further, you can remove extra SUB instructions. If you 
know that the loop count is at least 6, you can eliminate the extra SUB instruc- 
tions as shown in Example 5-32 and Example 5-33. The first five branch 
instructions are made unconditional, because they always execute. (If you do 
not know that the loop count is at least 6, you must keep the SUB instructions 
that decrement before each conditional branch as in Example 5-30 and 
Example 5-31.) Based on the elimination of six SUB instructions, the loop 
counter is now 51 (57 — 6) for the fixed-point dot product and 53 (59 — 6) for 
the floating-point dot product. This code shows some improvement over 
Example 5-30 and Example 5-31. The loop in Example 5-32 requires 63 
cycles (5 + 57 + 1) and the loop in Example 5-31 requires 79 cycles 
(5+ 59 +15). 


. Assembly Code for Fixed-Point Dot Product (Software Pipelined 
With Smallest Code Size) 


B -52 LOOP ; branch to loop 

MVK -S1 51,Al1 ; set up loop counter 

B -52 LOOP 7* branch to loop 

B ~o2 LOOP 7** branch to loop 

ZERO a ie A7 ; zero out sum0 accumulator 
ZERO «LZ B7 ; zero out suml accumulator 

B .82 LOOP ;*** branch to loop 

ZERO ail A6 ; zero out add input 

ZERO ~L2 B6é ; zero out add input 

B -82 LOOP 7; **** branch to loop 

ZERO .L1 A2 ; zero out mpy input 

ZERO ~L2 B2 ; zero out mpy input 

ADD .L1 A6,A7,A7 ; sum0O += (ai * bi) 

ADD .L2 B6,B7,B7 ; suml += (ai+l * bitl) 

MPY .M1X A2,B2,A6 o** ai * bi 

MPYH ~M2X A2,B2,B6 2** aitl * bitl 

SUB od Al,1,Al1 7; x*x*x*x*** decrement loop counter 
B «oz LOOP 7; ***** branch to loop 

LDW .D1 *A44++,A2 pxxeeeex Td ai & ait+l fm memory 
LDW .D2 *B4++,B2 7 xxxxee* Td bi & bitl fm memory 
; Branch occurs here 

ADD .L1X A7,B7,A4 ; sum = sum0 + suml 
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Example 5-33. Assembly Code for Floating-Point Dot Product (Software Pipelined 


With Smallest Code Size) 


branch to loop 
set up loop counter 


* branch to loop 
zero out mpysp input 
zero out mpysp input 


** branch 
zero out 
Zero out 


to loop 
sum0 accumulator 
sum0 accumulator 


7*** branch to loop 


zero out addsp input 
zero out addsp input 


;**** branch to loop 


zero out mpysp input 
zero out mpysp input 


pexkKkKKKKKK Load ai & ai 4 
pkeKaKKKKK Load bi & bi J 


+ 1 from memory 
+ 1 from memory 


**** pi = a0 b0 
RARE DoT = ad 
sum0 += (ai bi) 
suml += (aitl bi+1) 


7; ***** branch to loop 
7; xx*x*** decrement loop counter 


B Ok LOOP ; 
MVK 251 53,A1 ; 
B SZ LOOP 7 
ZERO » Ld. A7 7 
ZERO .L2 B7 ; 
B .S2 LOOP ; 
ZERO +L. A8 ; 
ZERO ~L2 B8 ; 
B «S2 LOOP 
ZERO my al A5 ; 
ZERO »L2 B5 ; 
B 282 LOOP 
ZERO ~L1 A6 ; 
ZERO ~L2 B6é ; 

LOOP 
LDDW D1 A4++,A7:A6 

| | LDDW .D2 B4++,B7:B6 

| | MPYSP .M1X A6,B6,A5 

| | MPYSP .M2X A7,B7,B5 

| | ADDSP .L1 A5,A8,A8 7 

| | ADDSP .L2 B5,B8,B8 7 

|| [A1] B $2 LOOP 

| | [Al] SUB ~S1 Al,1,Al1 

; Branch occurs here 
ADDSP .L1X A8,B8,A0 . 
ADDSP .L2X A8,B8,BO 7 
ADDSP .L1X A8,B8,A0 ; 
ADDSP .L2X A8,B8,BO ; 
NOP , 
ADDSP .L1X AO,BO,A5 ; 
NOP , 
ADDSP .L2X AO,BO,B5 ; 
NOP 3 
ADDSP .L1X A5,B5,A4 ; 
NOP 3 7 


sum(0) = sum0(0) suml (0) 
sum(1) = sum0(1) suml (1) 
sum(2) = sum0(2) suml (2) 
sum(3) = sum0(3) suml (3) 
wait for BO 

sum(01) = sum(0) + sum(1) 
wait for next BO 

sum(23) = sum(2) + sum(3) 
sum = sum(01) + sum(23) 
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Table 5-10 compares the performance of all versions of the fixed-point dot 
product code. Table 5-11 compares the performance of all versions of the 


floating-point dot product code. 


Table 5-10. Comparison of Fixed-Point Dot Product Code Examples 


Code Example 


100 Iterations 
2+100 x 16 


Cycle Count 
1602 


Fixed-point dot product linear assembly 


Example 5-9 

Example 5-10 
Example 5-19 
Example 5-26 


Example 5-28 


Example 5-30 


Example 5-32 


Fixed-point dot product parallel assembly 
Fixed-point dot product parallel assembly with LDW 
Fixed-point software-pipelined dot product 


Fixed-point software-pipelined dot product with no extrane- 
ous loads 


Fixed-point software-pipelined dot product with no prolog or 


epilog 


Fixed-point software-pipelined dot product with smallest 


1+100 x 8 
1+ (50 x 8) +1 
7+50+1 
7+434+7+1 


7+57+4+1 


5+57+1 


801 
402 
58 
58 


65 


63 


code size 


Table 5-11. 


Code Example 


Comparison of Floating-Point Dot Product Code Examples 


100 Iterations 
2+100 x 21 


Cycle Count 
2102 


Floating-point dot product nonparallel assembly 


Example 5—11 
Example 5-12 
Example 5-20 
Example 5-27 


Example 5-29 


Example 5-31 


Example 5-33 


Floating-point dot product parallel assembly 
Floating-point dot product parallel assembly with LDDW 
Floating-point software-pipelined dot product 


Floating-point software-pipelined dot product with no extra- 
neous loads 


Floating-point software-pipelined dot product with no prolog 
or epilog 


Floating-point software-pipelined dot product with small- 


1+100 x 10 


1+(50 x 10) +7 


9+50+15 
9+414+9+15 


7+59+15 


5+594+15 


1001 
508 
74 
74 


81 


79 


est code size 
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5.6 Modulo Scheduling of Multicycle Loops 


Section 5.5 demonstrated the modulo-scheduling technique for the dot 
product code. In that example of a single-cycle loop, none of the instructions 
used the same resources. Multicycle loops can present resource conflicts 
which affect modulo scheduling. This section describes techniques to deal 
with this issue. 


5.6.1 Weighted Vector Sum C Code 


Example 5-34 shows the C code for a weighted vector sum. 


Example 5-34. Weighted Vector Sum C Code 


void w_vec(short a[],short b[],short c[],short m) 


{ 


int i; 


for (i=0; i<100; i++) { 
e[i] = ((m * af[i]) >> 15) + b[i]; 
} 


5.6.2 Translating C Code to Linear Assembly 


Example 5-35 shows the linear assembly that executes the weighted vector 
sum in Example 5—34. This linear assembly does not have functional units as- 
signed. The dependency graph will help in those decisions. However, before 
looking at the dependency graph, the code can be optimized further. 


Example 5-35. Linear Assembly for Weighted Vector Sum Inner Loop 


LDH xaptr+t+,ai jai 

LDH *bptrt++,bi 7 bi 

MPY m,ai,pi 7 m* ai 

SHR pi,15,pi_scaled * Am * aly So 15 

ADD pi_scaled,bi,ci 3; cl = (m * ai) >> 15 + bi 

STH ci, *eptrt++ ; store ci 
[cntr] SUB enti, 1 ;-entr ; decrement loop counter 
[cntr]B LOOP ; branch to loop 


Modulo Scheduling of Multicycle Loops 


5.6.3. Determining the Minimum Iteration Interval 


Example 5-35 includes three memory operations in the inner loop (two LDHs 
and the STH) that must each use a .D unit. Only two .D units are available on 
any single cycle; therefore, this loop requires at least two cycles. Because no 
other resource is used more than twice, the minimum iteration interval for this 
loop is 2. 


Memory operations determine the minimum iteration interval in this example. 
Therefore, before scheduling this assembly code, unroll the loop and perform 
LDWSs to help improve the performance. 


5.6.3.1 Unrolling the Weighted Vector Sum C Code 


Example 5-36 shows the C code for an unrolled version of the weighted vector 
sum. 


Example 5-36. Weighted Vector Sum C Code (Unrolled) 


void w_vec(short a[],short b[],short c[],short m) 
{ 
int i; 
for (i=0; i<100; it=2) { 
e[i] = ((m * al[i]) >> 15) + bil; 
c[itl] = ((m * a[itl]) >> 15) + b[itl]; 
} 
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5.6.3.2 Translating Unrolled Inner Loop to Linear Assembly 


Example 5-37 shows the linear assembly that calculates c[i] and c[i+1] for the 
weighted vector sum in Example 5-36. 


i) 


L) 
L} 


The two store pointers (*ciptr and *ci+1ptr) are separated so that one 
(*ciptr) increments by 2 through the odd elements of the array and the 
other (*ci+1ptr) increments through the even elements. 


AND and SHR separate bi and bi+1 into two separate registers. 


This code assumes that mask is preloaded with OxOOOOFFFF to clear the 
upper 16 bits. The shift right of 16 places bi+1 into the 16 LSBs. 


Example 5-37. Linear Assembly for Weighted Vector Sum Using LDW 


[centr] 
[centr] 


D 
D 


iP 
P 
H 
H 


L 
L 
M 
M 
Ss 
s 
A 
Ss 
A 
A 
Ss 
Ss 
SU 
B 


Banmog DUD 


W 
W 
x 
YHL 


Pa 


*xaptr++,ai_itl ; ai & aitl 
*bptrt++,bi_itl ; bi & bitl 
m,ai_itl,pi m * ai 
m,ai_i+1l,pitl m * aitl 
pi,15,pi_scaled (m * ai) >> 15 
pitl1,15,pit+l_scaled (m * ait+l) >> 15 


bi_itl,mask,bi 
bi_it1,16,bit1 
pi_scaled,bi,ci 
pi+l_scaled,bi+1,citl 
ci, *ciptrt++[2] 

citl, *citlptrt+ [2] 
centr, 1,entr 


LOOP 


(m * ai) >> 15 + bi 

; citl = (m * aitl) >> 15 + bitl 
; store ci 

; store citl 

; decrement loop counter 

; branch to loop 


Sieh Nae Wi Nar War Noe Nee We Sie Nee as NNN 
oO 
sf pas 
IR 


5.6.3.3 Determining a New Minimum Iteration Interval 


Use the following considerations to determine the minimum iteration interval 
for the assembly instructions in Example 5-37: 


Ly 


L 


Four memory operations (two LDWs and two STHs) must each use a .D 
unit. With two .D units available, this loop still requires only two cycles. 


Four instructions must use the .S units (three SHRs and one branch). With 
two .S units available, the minimum iteration interval is still 2. 


The two MPYs do not increase the minimum iteration interval. 


Because the remaining four instructions (two ADDs, AND, and SUB) can 
all use a .L unit, the minimum iteration interval for this loop is the same as 
in Example 5-35. 


By using LDWs instead of LDHs, the program can do twice as much work in 
the same number of cycles. 


Modulo Scheduling of Multicycle Loops 


5.6.4 Drawing a Dependency Graph 


To achieve a minimum iteration interval of 2, you must put an equal number 
of operations per unit on each side of the dependency graph. Three operations 
in One unit on a side would result in an minimum iteration interval of 3. 


Figure 5-11 shows the dependency graph divided evenly with a minimum it- 
eration interval of 2. 
Figure 5-11. Dependency Graph of Weighted Vector Sum 


A side 
LDW 


: B side 
| 
D1 | 


MPY MPYHL 
/ 


R 
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5.6.5 Linear Assembly Resource Allocation 


Example 5-38. Linear Assembly for Weighted Vector Sum With Resources Allocated 


Using the dependency graph, you can allocate functional units and registers 
as shown in Example 5-38. This code is based on the following assumptions: 


(41 The pointers are initialized outside the loop. 
[1 m resides in B6, which causes both .M units to use a cross path. 
(1 The mask in the AND instruction resides in B10. 


p> 
<_< 


D 
D 


ROUGE eee eesske 


W 
W 


«DZ 
-D1 


*A4++,A2 
*B4++,B2 
A2,B6,A5 
A2,B6,B5 
A5,15,A7 
B5,15,B7 
B2,B10,B8 
B2,16,Bl 
A7,B8,A9 
B7,B1,B9 
A9, *A6++[2] 
B9, *BOt++[2] 
Al,1,Al1 
LOOP 


ee ee a Te 


ai & aitl 

bi & bitl 

pi =m * ai 

pit+l =m * ai+l 

pi_scaled = (m * ai) >> 15 
pit+l_scaled = (m * ait+l) >> 15 
bi 
bit+l 
ci = (m * ai) >> 15 + bi 

citl = (m * ait+l) >> 15 + bitl 
store ci 

store citl 

decrement loop counter 

branch to loop 


5.6.6 Modulo Iteration Interval Scheduling 
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Table 5-12 provides a method to keep track of resources that are a modulo 
iteration interval away from each other. In the single-cycle dot product exam- 
ple, every instruction executed every cycle and, therefore, required only one 
set of resources. Table 5-12 includes two groups of resources, which are 


necessary because you are scheduling a two-cycle loop. 


(J Instructions that execute on cycle k also execute on cycle k + 2, k + 4, etc. 
Instructions scheduled on these even cycles cannot use the same 
resources. 


(1 Instructions that execute on cycle k + 1 also execute on cycle k+3,k +5, 
etc. Instructions scheduled on these odd cycles cannot use the same 
resources. 


(1 Because two instructions (MPY and ADD) use the 1X path but do not use 
the same functional unit, Table 5-12 includes two rows (1X and 2X) that 


help you keep track of the cross path resources. 


Modulo Scheduling of Multicycle Loops 


Only seven instructions have been scheduled in this table. 


_} The two LDWs use the .D units on the even cycles. 


.) The MPY and MPYH are scheduled on cycle 5 because the LDW has four 
delay slots. The MPY instructions appear in two rows because they use 
the .M and cross path resources on cycles 5, 7, 9, etc. 


_) The two SHR instructions are scheduled two cycles after the MPY to allow 
for the MPY’s single delay slot. 


_} The AND is scheduled on cycle 5, four delay slots after the LDW. 
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Table 5-12. Modulo Iteration Interval Table for Weighted Vector Sum (2-Cycle Loop) 


Unit/Cycle 0 2 4 6 8 10 
-D1 er Bsus boy 4 5 A 
LDW ai_i+1 | LOW ai_i+1 | LDWai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 
.D2 a bc 12 4 8 
LDW bi_i+1 | LOW bi_i+1 | LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 
.M1 
.M2 
.L1 
.L2 
S1 
S2 
1X 
2X 
Unit/Cycle 1 3 5 7 9 11 
.D1 
.D2 
.M1 : F . : 
MPY pi MPY pi MPY pi MPY pi 
.M2 : . : : 
MPYHL pi+1 | MPYHL pi+1 MPYHL pi+1 MPYHL pi+1 
st AND bi AND bi AND bi AND bi 
.L2 
SI rr ‘ F 
SHR pi_s SHR pi_s SHR pi_s 
S2 : ‘ : 
SHR pi+1_s SHR pi+1_s SHR pi+1_s 
1X : F ‘ : 
MPY pi MPY pi MPY pi MPY pi 
en MPYHL pi+1 | MPYHL pi+1 MPYHL pi+1 MPYHL pi+1 
Note: The asterisks indicate the iteration of the loop; shaded cells indicate cycle 0. 
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5.6.6.1 Resource Conflicts 


Resources from one instruction cannot conflict with resources from any other 
instruction scheduled modulo iteration intervals away. In other words, for a 
2-cycle loop, instructions scheduled on cycle n cannot use the same resources 
as instructions scheduled on cycles n+ 2,n+4,n+6, etc. Table 5-13 shows 
the addition of the SHR bi+1 instruction. This must avoid a conflict of resources 
in cycles 5 and 7, which are one iteration interval away from each other. 


Even though LDW bi_i+1 (.D2, cycle 0) finishes on cycle 5, its child, SHR bi+1, 
cannot be scheduled on .S2 until cycle 6 because of a resource conflict with 
SHR pi+1_scaled, which is on .S2 in cycle 7. 


Figure 5—12. Dependency Graph of Weighted Vector Sum (Showing Resource Conflict) 


A side B side 
LDW 


MPY MPYHL 


Scheduled 
SHR on cycle 5 


Scheduled 


on cycle 7 pi_scaled 
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Table 5-13. Modulo Iteration Interval Table for Weighted Vector Sum With SHR 


Instructions 
Unit / Cycle 0 2 4 6 8 10, 12, 14, ... 
mM LDW ai_i+1 | LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 
aa LDW bi_i+1 | LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 
.M1 
.M2 
-L1 
.L2 
.S1 
.S2 ; ‘ F 
SHR bi+1 SHR bi+1 SHR bi+1 
1X 
2X 
Unit / Cycle 1 3 5 7 9 11, 13, 15, ... 
.D1 
.D2 
.M1 F : ‘ , 
MPY pi MPY pi MPY pi MPY pi 
.M2 : : ; : 
MPYHL pi+1 | MPYHL pi+1 MPYHL pi+1 MPYHL pi+1 
-L1 : : : ‘ 
AND bi AND bi AND bi AND bi 
.L2 
S1 : F ; 
SHR pi_s SHR pi_s SHR pi_s 
.S2 ‘ F ; 
SHR pi+1_s SHR pi+1_s SHR pi+1_s 
1X ; ‘ ‘ 7 
MPY pi MPY pi MPY pi MPY pi 
2X ; : F : 
MPYHL pi+1 | MPYHL pi+1 MPYHL pi+1 MPYHL pi+1 
Note: The asterisks indicate the iteration of the loop; shading indicates changes in scheduling from Table 5-12. 
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5.6.6.2 Live Too Long 


Scheduling SHR bi+1 on cycle 6 now creates a problem with scheduling the 
ADD ci instruction. The parents of ADD ci (AND bi and SHR pi_scaled) are 
scheduled on cycles 5 and 7, respectively. Because the SHR pi_scaled is 
scheduled on cycle 7, the earliest you can schedule ADD ci is cycle 8. 


However, in cycle 7, AND bi * writes bi for the next iteration of the loop, which 
creates a scheduling problem with the ADD ci instruction. If you schedule 
ADD cion cycle 8, the ADD instruction reads the parent value of bi for the next 
iteration, which is incorrect. The ADD ci demonstrates a live-too-long problem. 


No value can be live in a register for more than the number of cycles in the loop. 
Otherwise, iteration n + 1 writes into the register before iteration n has read that 
register. Therefore, in a 2-cycle loop, a value is written to a register at the end 
of cycle n, then all children of that value must read the register before the end 
of cycle n + 2. 


5.6.6.3 Solving the Live-Too-Long Problem 


The live-too-long problem in Table 5-13 means that the bi value would have 
to be live from cycles 6-8, or 3 cycles. No loop variable can live longer than 
the iteration interval, because a child would then read the parent value for the 
next iteration. 


To solve this problem move AND bi to cycle 6 so that you can schedule ADD ci 
to read the correct value on cycle 8, as shown in Figure 5—13 and Table 5-14. 
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Figure 5—13. Dependency Graph of Weighted Vector Sum (With Resource Conflict 
Resolved) 


A side 


B side 


2 
SHR 
7 
pi_scaled 
1 
ADD 


D 
1 
Gen) 
SUB 
C Ga 
1 
B 


Note: Shaded numbers indicate the cycle in which the instruction is first scheduled. 
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Table 5-14. Modulo Iteration Interval Table for Weighted Vector Sum (2-Cycle Loop) 


Unit/Cycle 0 2 4 6 8 10 
a LDW ai_i+1 | LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 LDW ai_i+1 
De LDW bi_i+1 | LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 LDW bi_i+1 
-M1 
.M2 
Hl ADD ci ADD ci 
.L2 : ; ; 
AND bi AND bi AND bi 
S1 
S2 ; : : 
SHR bi+1 SHR bi+1 SHR bi+1 
1X 
2X 
Unit/Cycle 1 3 5 7 9 11 
.D1 
.D2 
.M1 : : F ‘ 
MPY pi MPY pi MPY pi MPY pi 
.M2 : F : ‘ 
MPYHL pi+1 | MPYHL pi+1 MPYHL pi+1 MPYHL pi+1 
-L1 
.L2 
S1 . . : 
SHR pi_s SHR pi_s SHR pi_s 
S2 F ; : 
SHR pi+1_s SHR pi+1_s SHR pi+1_s 
1X : . : : 
MPY pi MPY pi MPY pi MPY pi 
2X : : : : 
MPYHL pi+1 | MPYHL pi+1 MPYHL pi+1 MPYHL pi+1 
Note: The asterisks indicate the iteration of the loop; shading indicates changes in scheduling from Table 5-13. 
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5.6.6.4 Scheduling the Remaining Instructions 


Figure 5-14 shows the dependency graph with additional scheduling 
changes. The final version of the loop, with all instructions scheduled correctly, 
is shown in Table 5—15. 


Figure 5-14. Dependency Graph of Weighted Vector Sum (Scheduling ci +1) 


Note: Shaded numbers indicate the cycle in which the instruction is first scheduled. 
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Table 5-15 shows the following additions: 


HOUOUCU 


To 


B LOOP (.S1, cycle 6) 
SUB cnir (.L1, cycle 5) 
ADD ci+1 (.L2, cycle 10) 
STH ci (cycle 9) 

STH ci+1 (cycle 11) 


avoid resource conflicts and live-too-long problems, Table 5-15 also 


includes the following additional changes: 


a a 


LDW bi_i+1 (.D2) moved from cycle 0 to cycle 2. 

AND bi (.L2) moved from cycle 6 to cycle 7. 

SHR pi+1_scaled (.S2) moved from cycle 7 to cycle 9. 
MPYHL pi+1 moved from cycle 5 to cycle 6. 

SHR bi+1 moved from cycle 6 to 8. 


From the table, you can see that this loop is pipelined six iterations deep, be- 
cause iterations n and n + 5 execute in parallel. 
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Table 5-15. Modulo Iteration Interval Table for Weighted Vector Sum (2-Cycle Loop) 


Unit/Cycle 0 2 4 6 8 10, 12, 14, ... 
a LOW ai_i#1 | LDWai_i+1 | LDWaii+1 | LDWaiiat | LOWaiiet | LDWai_i+t 
ne LDW bi i#1 | LOWbii+1 | LOWbiis1 | LOWbii+t | LOWbi i+ 
M1 
le MPYHL pi+1 | MPYHL pi+1 | MPYHL pi+1 
ol ADD ci ADD ci 
_ ADD ci+1 
SI B LOOP B LOOP B LOOP 
—e SHR bi+1 SHR bi+1 
as ADD ci ADD ci 
a MPYHL pi+1 | MPYHL pi+1 | MPYHL pi+1 

Unit/Cycle 1 3 5 7 9 11, 13, 15, ... 
STH ci STH ci 
.D2 STH ci+1 
M 4 : ak kK : 

MPY pi MPY pi MPY pi MPY pi 
M2 
Hl SUB cnir SUB cntr SUB centr SUB cnitr 
12 * kK 
AND bi AND bi AND bi 
SI SHR pi_s SHR pi_s SHR pi_s 
S2 SHR pi+1_s | SHRpi+1_s 
1X * ; ak kkk ; 
MPY pi MPY pi MPY pi MPY pi 
2X 
Note: The asterisks indicate the iteration of the loop; shading indicates changes in scheduling from Table 5-14. 
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5.6.7 Using the Assembly Optimizer for the Weighted Vector Sum 


Example 5—39 shows the linear assembly code to perform the weighted vector 
sum. You can use this code as input to the assembly optimizer to create a soft- 
ware-pipelined loop instead of scheduling this by hand. 


Example 5-39. Linear Assembly for Weighted Vector Sum 


-global _w_vec 
_w_vec: .cproc a, by, cy m 
.reg ai_il, bi_il, pi, pil, pi_il, pi_s, pil_s 
.reg mask, bi, bil, ci, cil, cl, centr 
MVK -1,mask ; set to all ls to create OxFFFFFFFF 
MVKH 0,mask ; clear upper 16 bits to create OxFFFF 
MVK 50,cntr ; centr = 100/2 
ADD 2,¢,C1 ; point to c[1] 
LOOP: -trip 50 
LDW -D2 *att,ai_il ; ai & aitl 
LDW -D1 *ot++,bi_il ; bi & bitl 
MPY -M1 ai_il,m,pi em *" setae 
MPYHL .M2 ai_il,m,pil in: ase ll 
SHR JS. pi, 15; pis 7; (m * ai) >> 15 
SHR «SZ pil,15,pil_s ; (m * ai+l) >> 15 
AND -L2X bi_il,mask,bij; bi 
SHR -S2 bi_il,16;bil ; bitl 
ADD ~L1X pi_s,bi,ci } cl = (m * @i) => 15 + bi 
ADD ~L2X pil_s,bil,cil; citl = (m * aitl) >> 15 + bitl 
STH «D2 ci, *e++[2] ; store ci 
STH -D1 cil,*cl++[2] ; store citl 
(enter) SUB cntr, 1, cnie ; decrement loop counter 
[centr] B LOOP ; branch to loop 
-endproc 
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5.6.8 Final Assembly 


Example 5-40 shows the final assembly code for the weighted vector sum. 
The following optimizations are included: 


i) 


While iteration n of instruction STH ci+1 is executing, iteration n + 1 of 
STH ciis executing. To prevent the STH ci instruction from executing itera- 
tion 51 while STH ci + 1 executes iteration 50, execute the loop only 49 
times and schedule the final executions of ADD ci+1 and STH ci+1 after 
exiting the loop. 


The mask for the AND instruction is created with MVK and MVKH in paral- 
lel with the loop prolog. 


The pointer to the odd elements in array cis also set up in parallel with the 
loop prolog. 


Modulo Scheduling of Multicycle Loops 


Example 5-40. Assembly Code for Weighted Vector Sum 


[Al] 


[Al] 


[Al] 


[Al] 


[Al] 


LOOP: 


[Al] 


LDW 


ADD 


PYHL 


-D1 


.L2X 


-D2 
-D1 


92 


D2 
-D1 
ol 
-S2 


.M1X 
-L1 


-M2X 
-S1 
.D2 
-D1 


-S1 
~L2 
.M1X 
-L1 


S52 
.L1X 
-M2X 
ol 
.D2 
D1 


.S2 
-D1 
ol 
~L2 
-L1 
.M1X 


L2 
92 
.L1X 
-M2X 
Sl 
-D2 
-D1 


*A44+4+,A2 


Ao,2,B0 


*B44 
*D44 


Pt, BZ 
t+, A2 


-1,B10 


*B44 
*D44 


t+, B2 
t+, A2 


49,Al 


0,Bl 


0 


A2,B6,A5 


Al, 


1,Al 


A2,B6,B5 
LOOP 


*B44 
*D44 


AS, 


t+, B2 
t+, A2 


15,A7 
B2,B10,B8 


A2,B6,A5 


Al, 


B2,1 


1,Al 


l6,Bl1 


A7,B8,A9 
A2,B6,B5 
LOOP 


*B44 
*D44 


BS; 


A5,1 


B2,B10,B8 
1,Al 


Al, 


t+, B2 
t+, A2 


15,B7 
AQ, *A6++[2] 


15,A7 


A2,B6,A5 


B7,B1,B9 
B2,16,Bl 
A7,B8,A9 
A2,B6,B5 
LOOP 


*B44 
*D44 


t+, B2 
t+, A2 


; ai & ait+l 
; set pointer to cit+l 


; bi & bitl 
7;* ai & aitl 


; set to all ls (OxFFFFFFFF) 


7* bi & bitl 
7** ai & aitl 
; set up loop counter 

; clr upper 16 bits (Ox0000FFFF) 


7; om * ai 
; decrement loop counter 


7; om * aitl 
; branch to loop 
7** bi & bitl 
7*** ai & aitl 


; (m * ai) >> 15 

e Aja 

7* m * ai 

;* decrement loop counter 


; bitl 

; ci = (m * ai) >> 15 + bi 
7;* m * aitl 

7* branch to loop 

7*** bi & bitl 

pe*** ai & aitl 


; (m * aitl) >> 15 

; store ci 

7* (m * ai) >> 15 

7* bi 

;** decrement loop counter 
7** m * al 


; cit+tl = (m * aitl) >> 15 + bitl 
7* bitl 
7* ci = (m * ai) >> 15 + bi 


pe Te Oe sees 

;** branch to loop 
PeeES bi, & baa 
7***e* ai & aitl 
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Example 5-40. Assembly Code for Weighted Vector Sum (Continued) 


[Al] 


STH -D2 B9, *BO++[2] 
SHR -S2 B5;15,;B7 
STH -D1 AQ, *A6++[2] 
SHR .S1 A5,15,A7 
AND Jie B2,B10,B8 
SUB al Al,1,Al 
MPY .M1X A2,B6,A5 


; Branch occurs here 
ADD -L2 B7,Bl1,B9 


STH -D2 B9, *BO 


, 
, 
, 
, 


, 


store ci+l 


;* (m * ait+l) >> 15 


* store ci 
** (m * ai) >> 15 
aR. TD a. 


7*** decrement loop counter 


, 


’ 


, 


x*kK m * al 


cit+tl = (m * aitl) 


store citl 


>> 15 + bitl 
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5.7 Loop Carry Paths 


Loop carry paths occur when one iteration of a loop writes a value that must 
be read by a future iteration. A loop carry path can affect the performance of 
a software-pipelined loop that executes multiple iterations in parallel. Some- 
times loop carry paths (instead of resources) determine the minimum iteration 
interval. 


IIR filter code contains a loop carry path; output samples are used as input to 
the computation of the next output sample. 
5.7.1 IIR Filter C Code 


Example 5—41 shows C code for a simple IIR filter. In this example, y[i] is an 
input to the calculation of y[i+1]. Before y[i] can be read for the next iteration, 
y[i+1] must be computed from the previous iteration. 


Example 5—41. IIR Filter C Code 


void iir(short x[],short y[],short cl, short c2, short c3) 
{ 


int i; 


for (i=0; i<100; itt) { 
y[itl] = (cl*x[i] + c2*x[itl] + c3*y[i]) >> 15; 
} 
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5.7.2 Translating C Code to Linear Assembly (Inner Loop) 


Example 5—42 shows the C6000 instructions that execute the inner loop of the 
IIR filter C code. In this example: 


Lj xptr is not postincremented after loading xi+1, because xi of the next 
iteration is actually xi+1 of the current iteration. Thus, the pointer points to 
the same address when loading both xi+1 for one iteration and xi for the 
next iteration. 


(1 yptr is also not postincremented after storing yi+1, because yi of the next 
iteration is yi+1 for the current iteration. 


Example 5-42. Linear Assembly for IIR Inner Loop 


LDH *xptrt++,xi ; Xitl 
MPY cl,xi,p0O Fie oe lit eo 
LDH SAO IS, Sekar dl OF Sealeeil 
MPY e2, xi+1,pl eo G2-* xa 1 
ADD p0,pl,s0 3; cl * xi + c2 * xit+l 
LDH PASE LEAP yp WL A Sal 
MPY c3,yi,p2 i. C3: Fuya 
ADD s0,p2,sl eC © Se. bea * “sachil  S 
SHR $1,15,yitl 7 yitl 
STH yitl,*yptr ; store yitl 
[cntr] SUB chtr, 1, entre ; decrement loop counter 
[cntr]B LOOP ; branch to loop 
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5.7.3. Drawing a Dependency Graph 


Figure 5-15 shows the dependency graph for the IIR filter. A loop carry path 
exists from the store of yi+1 to the load of yi. The path between the STH and 
the LDH is one cycle because the load and store instructions use the same 
memory pipeline. Therefore, ifa store is issued to a particular address on cycle 
nand aload from that same address is issued on the next cycle, the load reads 
the value that was written by the store instruction. 


Figure 5—15. Dependency Graph of IIR Filter 


A side B side 
LDH LDH LDH 


Note: The shaded numbers show the loop carry path:5+2+1+1+1=10. 
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5.7.4 Determining the Minimum Iteration Interval 


To determine the minimum iteration interval, you must consider both resources 
and data dependency constraints. Based on resources in Table 5—16, the 
minimum iteration interval is 2. 


Note: 


There are six non-.M units available: three on the A side (.S1, .D1, .L1) and 
three on the B side (.S2, .D2, .L2). Therefore, to determine resource 
constraints, divide the total number of non-.M units used on each side by 3 
(3 is the total number of non-.M units available on each side). 


Based on non-.M unit resources in Table 5-16, the minimum iteration inter- 
val for the IIR filter is 2 because the total non-.M units on the A side is 5 (5 = 3 
is greater than 1 so you round up to the next whole number). The B side uses 
only three non-.M units, so this does not affect the minimum iteration interval, 
and no other unit is used more than twice. 


Cd 


Table 5-16. Resource Table for IIR Filter 


(a) A side (b) B side 

Unit(s) Instructions Total/Unit | Unit(s) Instructions Total/Unit 
.M1 2 MPYs 2 | .M2 MPY 1 

S1 B 1 | .S2 SHR 1 

.D1 2 LDHs 2 | .D2 STH 1 
.L1,.S1, or .D1 ADD & SUB 2 | .L2 or .S2,.D2 ADD 1 
Total non-.M units 5 | Total non-.M units 3 


However, the IIR has a data dependency constraint defined by its loop carry 
path. Figure 5-15 shows that if you schedule LDH yi on cycle 0: 


Lj The earliest you can schedule MPY p2 is on cycle 5. 


The earliest you can schedule ADD s1 is on cycle 7. 


J 
1 SHR yi+1 must be on cycle 8 and STH on cycle 9. 
J 


Because the LDH must wait for the STH to be issued, the earliest the the 
second iteration can begin is cycle 10. 


To determine the minimum loop carry path, add all of the numbers along the 
loop paths in the dependency graph. This means that this loop carry path is 
10(5+42+1+4+1+41). 


Loop Carry Paths 


Although the minimum iteration interval is the greater of the resource limits and 
data dependency consiraints, an interval of 10 seems slow. Figure 5-16 
shows how to improve the performance. 


5.7.4.1 Drawing a New Dependency Graph 


Figure 5-16 shows a new graph with a loop carry path of 4 (2 +1 + 1). because 
the MPY p2 instruction can read yi+1 while itis stillin a register, you can reduce 
the loop carry path by six cycles. LDH yi is no longer in the graph. Instead, you 
can issue LDH y[0] once outside the loop. In every iteration after that, the y+1 
values written by the SHR instruction are valid y inputs to the MPY instruction. 


Figure 5—16. Dependency Graph of IIR Filter (With Smaller Loop Carry) 


A side 


B side 


Note: The shaded numbers show the loop carry path:2+1+1=4. 
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5.7.4.2 New ’C6x Instructions (Inner Loop) 


Example 5—43 shows the new linear assembly from the graph in Figure 5-16, 
where LDH yi was removed. The one variable y that is read and written is yi 
for the MPY p2 instruction and yi+1 for the SHR and STH instructions. 


Example 5—43. Linear Assembly for IIR Inner Loop With Reduced Loop Carry Path 


LDH *xptrt+t+, xi ; 
MPY cl,xi,p0O A 
LDH *xptr,xitl ; 
MPY e2,;xXi+1,;pl1 ‘ 
ADD p0,pl,s0 ; 
MPY C3ip Wp 12 5 
ADD s0,p2,s1l ; 
SHR Sil, ILD, ; 
SiH Win SAoieiesrsr : 
[cntr] SUB chtr, 1, entre ; 
[entr]B LOOP ‘ 


xitl 
cl. * 
xit+l 
* 
* 
* 
* 


c2 
el 
eS 
el 
yale il 


store yitl 
decrement loop counter 
branch to loop 


xi 


xit+l 

xi + c2 * xitl 

yi 

ps a PART a Po A a8 


5.7.5 Linear Assembly Resource Allocation 


Example 5-44 shows the same linear assembly instructions as those in 
Example 5-43 with the functional units and registers assigned. 


Example 5-44. Linear Assembly for IIR Inner Loop (With Allocated Resources) 


LDH “Di 
MPY 1 
LDH D1 
MPY 1X 
ADD L1 
MPY -M2X 
ADD .L2X 
SHR S2 
STH D2 
[Al] SUB «Aad, 
[Al] B Fasule 


*A4++,A2 
A6,A2,A5 
*A4,A3 
B6,A3,A7 
A5,A7,A9 
A8,B2,B3 
B3,A9,B5 
B5,15,B2 
B2,*B4++ 
Al,1,Al 
LOOP 


Fae a 
eel 
: i+ 
2 
7 cl 
eS 
eG RO oh eR ee Sea ah ce Se Aya 
; yitl 
; store yit+l 
; decrement loop counter 
7 branch to loop 


xi 


Xa boe2 * alee 


+ + +P PE 
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5.7.6 Modulo Iteration Interval Scheduling 


Table 5—17 shows the modulo iteration interval table for the IIR filter. The SHR 
instruction on cycle 10 finishes in time for the MPY p2 instruction from the next 
iteration to read its result on cycle 11. 


Table 5-17. Modulo Iteration Interval Table for IIR (4-Cycle Loop) 


Unit/Cycle .. || Unit/Cycle 

‘DI LDH xi LDH xi LDH xi ‘DI LDH xi#1 | Lu xigd | LDH cit 
De pone D2 

M1 M1 MPY po MPY po 
‘M2 ‘M2 

uy uy SUB cntr | ous ontr 
12 12 oie) 
‘St ‘Si 

‘82 82 

1X 1X 

2x 2x ADD s1 

Unit/Cycle 10, 14, 18, ... || Unit/Cycle 11, 15, 19, ... 

Di Di 

D2 D2 STH yi+1 
‘MI MPY pt MPY p1 ‘MI 

ae oe MPY p2 | MPY p2 
ui ui 

2 2 

‘St BLOOP | pBioop St 

S2 SHR yi+1 S2 

. MPY pt MPY p1 va 

2x 2x MPY p2 MPY p2 


Note: The asterisks indicate the iteration of the loop. 
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5.7.7 Using the Assembly Optimizer for the IIR Filter 


Example 5—45 shows the linear assembly code to perform the IIR filter. Once 
again, you can use this code as input to the assembly optimizer to create a soft- 
ware-pipelined loop instead of scheduling this by hand. 


Example 5—45. Linear Assembly for IIR Filter 


-gGlobal _iir 
TEE Y ~CpEOG x, “ye cl; 162) cS 
.reg Kip KiL,. yal 
.reg pO, pl, p2, sO, sl, cntr 
MVK 100, ents 7 centr = 100 
LDH .D2 *yt++,yil ; yitl 
LOOP: .trip 100 
LDH D1 ¥*x++,xi 2 #1 
MPY -M1l cl1,xi,p0 eid, “Scat 
LDH .D1 *x%,;xil 3; xitl 
MPY -M1X c2,xil,pl 2 C2 * 23h] 
ADD -L1 p0,pl,s0 eG. FSek. se Ee Sea 
MPY -M2X c3,yil,p2 as Gar fe Sele 
ADD -L2X s0,;p2;sl1 ge Gli Rea GR aes oh. GS FH ay. 
SHR -S2 s1,15,yil 7; yitl 
STH “D2 yil, *y++ ; store yit+l 
[centr] SUB whi entr;, 1, entxr ; decrement loop counter 
[cntr] B -S1 LOOP ; branch to loop 
-endproc 
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5.7.8 Final Assembly 


Example 5—46 shows the final assembly for the IIR filter. With one load of y[0] 
outside the loop, no other loads from the y array are needed. Example 5-46 
requires 408 cycles: (4100) + 8. 


Example 5-46. Assembly Code for IIR Filter 


LDH .D1 *R44+4+,A2 ; xi 
LDH .D1 *A4,A3 : xai+1 
LDH «D2 *B44++,B2 ; load y[0] outside of loop 
MVK soul 100,Al ; set up loop counter 
LDH .D1 *A4++,A2 pe xi 
[Al] SUB ~L1 Al,1,Al1 ; decrement loop counter 
\ | MPY .M1 A6,A2,A5 ely 3 ea 
\ | LDH -D1 *2R4,A3 7* xitl 
MPY .M1X  B6,A3,A7 7 c2 * xitl 
|| [Al] B aod LOOP ; branch to loop 
MPY .M2X A8,B2,B3 7 c3 * yi 
LOOP: 
ADD ell A5,A7,A9 scl * 3a + 62 * 141 
LDH .D1 *A4++,A2 pee xi 
ADD .L2X B3,A9,B5 «Gl * sa + c2 * Sait] + ies A ya 
[Al] SUB epi Al,1,Al1 ;* decrement loop counter 
MPY .M1 A6,A2,A5 ek. ued Sk) Sea 
LDH .D1 *A4,A3 7** xitl 
SHR ~S2 B5,15,B2 ; yitl 
MPY .M1X  B6,A3,A7 7* c2 * xitl 
{Al] B oul LOOP 7* branch to loop 
STH .D2 B2, *B4++ ; store yitl 
i MPY .M2X A8,B2,B3 i* c3 * yi 
; Branch occurs here 
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5.8 If-Then-Else Statements in a Loop 


5.8. 


lf-then-else statements in C cause certain instructions to execute when the if 
condition is true and other instructions to execute when it is false. One way to 
accomplish this in linear assembly code is with conditional instructions. Be- 
cause all C6000 instructions can be conditional on one of five general-pur- 
pose registers on the 'C62x and ’C67x and one of 6 on the ’C64x. Conditional 
instructions can handle both the true and false cases of the if-then-else C 
statement. 


1. If-Then-Else C Code 


Example 5—47 contains a loop with an if-then-else statement. You either add 
a[i] to sum or subtract ali] from sum. 


Example 5-47. If-Then-Else C Code 


{ 


int if_then(short a[], int codeword, int mask, short theta) 


int i,sum, cond; 


sum = 0; 
for (i = 0; i < 32; i++) { 
cond = codeword & mask; 


if (theta == !(!(cond))) 
sum += ali]; 

else 
sum -= a[i]; 


mask = mask << 1; 
} 


return (sum) ; 


Branching is one way to execute the if-then-else statement: branch to the ADD 
when the if statement is true and branch to the SUB when the if statement is 
false. However, because each branch has five delay slots, this method 
requires additional cycles. Furthermore, branching within the loop makes soft- 
ware pipelining almost impossible. 


Using conditional instructions, on the other hand, eliminates the need to 
branch to the appropriate piece of code after checking whether the condition 
is true or false. Simply program both the ADD and SUB as usual, but make 
them conditional on the zero and nonzero values of a condition register. This 
method also allows you to software pipeline the loop and achieve much better 
performance than you would with branching. 
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5.8.2 Translating C Code to Linear Assembly 


Example 5—48 shows the linear assembly instructions needed to execute in- 
ner loop of the C code in Example 5-47. 


Example 5—48. Linear Assembly for If-Then-Else Inner Loop 


AND 
cond] MVK 
CMPE 
LDH 
if] ADD 
'if] SUB 
SHL 
centr] ADD 
cntr]B 


codeword,mask,cond ; cond = codeword & mask 
1, cond ® '¢! (eond) ) 
theta,cond,if ; (theta == !(! (cond))) 
*aptr+t+,ali a= | 

sum,ai,sum ; sum += a[il] 

sum, ai,sum ; sum —-= a[il] 
mask,1,mask ; mask = mask << 1; 

“1, centr, cntr ; decrement counter 

LOOP + for LOOP 


CMPEQis used to create IF. The ADD is conditional when IF is nonzero (corre- 
sponds to then); the SUB is conditional when IF is 0 (corresponds to else). 


A conditional MVK performs the !(!(cond)) C statement. If the result of the 
bitwise AND is nonzero, a 1 is written into cond; if the result of the AND is 0, 
cond remains at 0. 
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5.8.3 Drawing a Dependency Graph 


Figure 5-17 shows the dependency graph for the if-then-else C code. This 
graph illustrates the following arrangement: 


(1 Two nodes on the graph contain sum: one for the ADD and one for the 
SUB. Because some iterations are performing an ADD and others are 
performing a SUB, each of these nodes is a possible input to the next itera- 
tion of either node. 


(} The LDH ai instruction is a parent of both ADD sum and SUB sum, be- 
cause both instructions read ai. 


_} CMPEQ if is also a parent to ADD sum and SUB sum, because both read 
IF for the conditional execution. 


1 The result of SHL mask is read on the next iteration by the AND cond 
instruction. 


Figure 5-17. Dependency Graph of If-Then-Else Code 


A side 
SHL 


B side 
AND 
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5.8.4 Determining the Minimum Iteration Interval 


With nine instructions, the minimum iteration interval is at least 2, because a 
maximum of eight instructions can be in parallel. Based on the way the depen- 
dency graph in Figure 5—17 is split, five instructions are on the A side and four 
are on the B side. Because none of the instructions are MPYs, all instructions 
must go on the .S, .D, or .L units, which means you have a total of six 
resources. 


_} LDH must be on a.D unit. 

_j SHL, B, and MVK must be on a.S unit. 

_) The ADDs and SUB can be on the .S, .L, or .D units. 

(J The AND can be ona.S or .L unit, or .D unit ((C64x only) 


From Table 5-18, you can see that no one resource is used more than two 
times, so the minimum iteration interval is still 2. 


Table 5-18. Resource Table for If-Then-Else Code 


(a) A side (b) B side 

Unit(s) Instructions Total/Unit Unit(s) Instructions Total/Unit 

.M1 0 .M2 0 

S1 SHL & B 2 .S2 MVK 1 

.D1 LDH 1 .L2 CMPEQ 1 

.L1,.S1,or.D1 ADD & SUB 2 .L2 or .S2 AND 1 
.L2,.S2,or.D2 ADD 1 

Total non-.M units 5 Total non-.M units 4 


The minimum iteration interval is also affected by the total number of instruc- 
tions. Because three units can perform nonmultiply operations on a given side, 
a total of five instructions can be performed with a minimum iteration interval 
of 2. Because only four instructions are on the B side, the minimum iteration 
interval is still 2. 
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5.8.5 Linear Assembly Resource Allocation 


Now that the graph is split and you know the minimum iteration interval, you 
can allocate functional units and registers to the instructions. You must ensure 
that no resource is used more than twice. 


Example 5—49 shows the linear assembly with the functional units and regis- 
ters that are used in the inner loop. 


Example 5-49. Linear Assembly for Full If-Then-Else Code 


_if_then: 


LOOP: 
[cond 
[if 
[lif 


[entre 
[entr 


-global _if_then 
-cproc a, cword, mask, theta 
.reg cond, if, al, sum, centr 
MVK j2, Cntr ; entr = 32 
ZERO sum 7; sum = 0 
-trip 32 
AND «82% cword,mask,cond; cond = codeword & mask 
MVK «52 1,cond y 14! (e0ond)) 
CMPEQ -L2 theta,cond,if j; (theta == !(!(cond))) 
LDH .D1 *att+,ai 7; afi] 
ADD oe all sum, ai,sum ; sum += a[il] 
SUB DI sum, ai,sum ; sum —-= a[il] 
SHL al mask,1,mask ; mask = mask << 1; 
ADD LZ =l, contre, cuir ; decrement counter 
B 5 SL. LOOP * for LOOP 
.return sum 
.endproc 
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5.8.6 Final Assembly 


Example 5-50 shows the final assembly code after software pipelining. The 
performance of this loop is 70 cycles (2 x 32 +6). 


Example 5-50. Assembly Code for If-Then-Else 


MVK “52 32,B0 ; set up loop counter 
[BO] ADD «LZ =1,B0,B0 ; decrement counter 
[BO] ADD LZ -1,B0,BO0 ; decrement counter 
[BO] B sod LOOP ; for LOOP 
LDH Dd *A4+4+,A5 een] 
SHL oul A6,1,A6 ; mask = mask << 1; 
AND ~S2X B4,A6,B2 ; cond = codeword & mask 
[B2] MVK ~o2 1,B2 » Ltt (eond)) 
[BO] ADD LZ =1,B0,B0 ; decrement counter 
{BO] B Sal LOOP ;* for LOOP 
LDH 2D *A44+4+,A5 i* ali] 
CMPEQ .L2 Bo,B2,Bl1 ; (theta == !(! (cond) )) 
SHL ~S1 A6,1,A6 ;* mask = mask << 1; 
AND ~S2X B4,A6,B2 ;* cond = codeword & mask 
ZERO oa fe A7 j; zero out accumulator 
LOOP: 
[BO] ADD LZ -1,B0,B0 ; decrement counter 
{[B2] MVK ~o2 1,B2 ee 2 (1 {(oond) ) 
[BO] B aol LOOP ;** for LOOP 
LDH .D1 *A4+4+,A5 i** ali] 
[B1] ADD ~L1 A7,A5,A7 ; sum += a[il] 
{!B1]SUB ~DL A7,A5,A7 ; sum -= a[il] 
CMPEQ .L2 B6o,B2,Bl1 7;* (theta == !(! (cond))) 
SHL .S1 A6,1,A6 7;** mask = mask << 1; 
AND ~S2X B4,A6,B2 ;** cond = codeword & mask 
; Branch occurs here 
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5.8.7 Comparing Performance 


You can improve the performance of the code in Example 5-50 if you know 
that the loop countis at least 3. If the loop count is at least 3, remove the decre- 
ment counter instructions outside the loop and put the MVK (for setting up the 
loop counter) in parallel with the first branch. These two changes save two 
cycles at the beginning of the loop prolog. 


The first two branches are now unconditional, because the loop count is at 
least 3 and you know that the first two branches must execute. To account for 
the removal of the three decrement-loop-counter instructions, set the loop 
counter to 3 fewer than the actual number of times you want the loop to 
execute: in this case, 29 (32 — 3). 


Example 5-51. Assembly Code for If-Then-Else With Loop Count Greater Than 3 


B wou LOOP ; for LOOP 
LDH .D1 *R44++,A5 j; ali] 
MVK 2O4 29,B0 ; set up loop counter 
SHL sol A6,1,A6 ; mask = mask << 1; 
AND ~S2X B4,A6,B2 ; cond = codeword & mask 
[B2] MVK ~S2 1,B2 ; !(! (cond) ) 
B sol LOOP ;* for LOOP 
LDH Di *R4++,A5 3;* ali] 
CMPEQ .L2 B6,B2,Bl1 ; (theta == !(! (cond))) 
SHL Si Ao,1,A6 ;* mask = mask << 1; 
AND ~S2X B4,A6,B2 ;* cond = codeword & mask 
ZERO od A7 ; zero out accumulator 
LOOP: 
[BO] ADD ~L2 -—1,B0,B0 ; decrement counter 
[B2] MVK ~S2 1,B2 7* 14! (cond)) 
[BO] B <Si LOOP 7** for LOOP 
LDH .D1 *R44++,A5 7** afi] 
[B1] ADD eal: A7,A5,A7 ; sum += afi] 
[!B1]SUB .D1 A7,A5,A7 ; sum -= a[il] 
CMPEQ .L2 Bo,B2,Bl1 7* (theta == !(! (cond))) 
SHL Prowl A6o,1,A6 7;** mask = mask << 1; 
AND ~S2X B4,A6,B2 ;** cond = codeword & mask 


; Branch occurs here 


Example 5-51 shows the improved loop with a cycle count of 68 (2 x 32+ 4). 
Table 5-19 compares the performance of Example 5—50 and Example 5-51. 
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Table 5-19. Comparison of If-Then-Else Code Examples 


Code Example Cycles Cycle Count 
Example 5—50 __sIf-then-else assembly code (2 x 32) +6 70 
Example 5—51_‘If-then-else assembly code with loop count greater than3 (2 x 32)+4 68 
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5.9 Loop Unrolling 


Even though the performance of the previous example is good, it can be im- 
proved. When resources are not fully used, you can improve performance by 
unrolling the loop. In Example 5—52, only nine instructions execute every two 
cycles. If you unroll the loop and analyze the new minimum iteration interval, 
you have room to add instructions. A minimum iteration interval of 3 provides 
a 25% improvement in throughput: three cycles to do two iterations, rather 
than the four cycles required in Example 5—51. 


5.9.1 Unrolled If-Then-Else C Code 


Example 5-52 shows the unrolled version of the if-then-else C code in 
Example 5-47 on page 5-86. 


Example 5-52. If-Then-Else C Code (Unrolled) 


int unrolled_if_then(short a[], int codeword, int mask, short theta) 


{ 


int i,sum, cond; 


sum = 0; 
for (i = 0; i < 32; it=2){ 


cond = codeword & mask; 
if (theta == !(!(cond))) 
sum += a[il]; 
else 
sum -= ali]; 


mask = mask << 1; 


cond = codeword & mask; 


if (theta == !(! (cond))) 
sum += a[itl]; 

else 
sum —-= a[it+l1]; 


mask = mask << 1; 
} 
return (sum); 


} 
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5.9.2 Translating C Code to Linear Assembly 


Example 5-53 shows the unrolled inner loop with 16 instructions and the 
possibility of achieving a loop with a minimum iteration interval of 3. 


Example 5-53. Linear Assembly for Unrolled If-Then-Else Inner Loop 


AND codeword, maski, condi ; condi = codeword & maski 
[condi] MVK 1,condi , !(! (condi) ) 
CMPEQ theta,condi,ifi ; (theta == !(! (condi))) 
LDH *aptrt+t+,al ; ali] 
{[ifi] ADD sumi,ai,sumi ; sum += a[il] 
[!ifi] SUB sumi, ai, sumi ; sum -= a[il] 
SHL maski,1,maskit+tl ; maskitl = maski << 1; 
AND codeword,maski+1l,condit+tl; condi+l = codeword & maski+l 
[condi+1]MVK 1,conditl ; !(! (condi+1) ) 
CMPEQ theta, conditl,ifit+l ; (theta == !(! (condi+l))) 
LDH *aptr++,aitl ; alit! 
[ifit+tl] ADD sumi+1l,aitl,sumitl ; sum += a[itl] 
[!ifi+l] SUB sumitl,ai+1l,sumit+l ; sum —-= a[itl] 
SHL maski+1,1,maski ; maski = maskit+l << 1; 
{entr] ADD =| ,CnEr, centr ; decrement counter 
[centr] B&B LOOP ; for LOOP 
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5.9.3 Drawing a Dependency Graph 


Although there are numerous ways to split the dependency graph, the main 
goal is to achieve a minimum iteration interval of 3 and meet these conditions: 


[J You cannot have more than nine non-.M instructions on either side. 
(J Only three non-.M instructions can execute per cycle. 


Figure 5-18 shows the dependency graph for the unrolled if-then-else code. 
Nine instructions are on the A side, and seven instructions are on the B side. 


Figure 5—18. Dependency Graph of If-Then-Else Code (Unrolled) 


A side B side 


AND SHL AND 
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5.9.4 Determining the Minimum Iteration Interval 


With 16 instructions, the minimum iteration interval is at least 3 because a 
maximum of six instructions can be in parallel with the following allocation 
possibilities: 


_j LDH must be on a.D unit. 

_j SHL, B, and MVK must be on a.S unit. 

(J The ADDs and SUB can be ona..§S, .L, or .D unit. 

(J The AND can be ona.S or .L unit, or .D unit ('C64x only) 


From Table 5—20, you can see that no one resource is used more than three 
times so that the minimum iteration interval is still 3. 


Checking the total number of non-.M instructions on each side shows that a 
total of nine instructions can be performed with the minimum iteration interval 
of 3. because only seven non-.M instructions are on the B side, the minimum 
iteration interval is still 3. 


Table 5-20. Resource Table for Unrolled If-Then-Else Code 


(a) A side (b) B side 

Unit(s) Instructions Total/Unit Unit(s) Instructions Total/Unit 
.M1 0 .M2 0 

S1 MVK and 2 SHLs 3 .S2 MVK and B 2 

.D1 2 LDHs 2 .L2 CMPEQ 1 

-L1 CMPEQ 1 .L2 pr.S2 AND 1 

.L1 or .S1 AND 1 .L2 ,.S2, or .D2 SUB and 2ADDs 3 
.L1,.S1,or.D1 ADD and SUB 2 

Total non-.M units 9 Total non-.M units 7 


5.9.5 Linear Assembly Resource Allocation 


Now that the graph is split and you know the minimum iteration interval, you 
can allocate functional units and registers to the instructions. You must ensure 
no resource is used more than three times. 


Example 5-54 shows the linear assembly code with the functional units and 
registers. 
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Example 5-54. Linear Assembly for Full Unrolled If-Then-Else Code 


LOOP: 


[cdi] 


[ifil] 
Pier 3 


[cdil 


[ifil 
Pha ta 


[centr 
[centr 


_unrolled_if_then: 


-global _unrolled_if_then 
-cproc a, cword, mask, theta 
.reg cword, mask, theta, ifi, ifil, a, ai, ail, cntr 
.reg cdi, cdil, sumi, sumil, sum 
MV A4,a ; C callable register for lst operand 
MV B4, cword ; C callable register for 2nd operand 
MV A6o,mask ; C callable register for 3rd operand 
MV B6,theta ; C callable register for 4th operand 
MVK 16,cntr ; entr = 32/2 
ZERO sumi ; sumi = 0 
ZERO sumil j sumi+l = 0 
bELp 32 
AND .L1X cword,mask,cdi ; cdi = codeword & maski 
MVK sol, “1,eda 7 !'(!(cdi)) 
CMPEQ 1X theta,cdi,ifi ; (theta == !(! (cdi))) 
LDH D1 *att,ai ; alil 
ADD .L1 sumi,ai, sumi j; sum += a[il] 
SUB -D1 sumi,ai,sumi ; sum -= a[i] 
SHL .S1 mask,1,mask ; maskit+tl = maski << 1; 
AND .L2X cword,mask,cdil; cditl = codeword & maskitl 
MVK as2: -1 eal ; !'(! (cdi+l) ) 
CMPEQ 2 theta,cdil,ifil; (theta == !(! (cdi+l))) 
LDH -D1l *at+,ail ; a[itl] 
ADD -L2 sumil,ail,sumil; sum += a[itl] 
SUB .D2 sumil,ail,sumil; sum -= a[itl] 
SHL .S1 mask,1,mask ; maski = maskit+l << 1; 
ADD -D2 -1,cntr,cntr ; decrement counter 
B -S2 LOOP * for LOOP 
ADD sumi, sumil, sum ; Add sumi and sumitl for ret value 


.-return sum 


-endproc 
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5.9.6 Final Assembly 


Example 5-55 shows the final assembly code after software pipelining. The 
cycle count of this loop is now 53: (3 x 16) + 5. 


Example 5-55. Assembly Code for Unrolled If-Then-Else 


MVK 782 16,B0 ; set up loop counter 
LDH -D1 *R4++,A5 cer (| 
[BO] ADD .D2 -1,B0,B0 ; decrement counter 
LDH .D1 *A44++,B5 ; afitl] 
[BO] B ~S2 LOOP ; for LOOP 
[BO] ADD .D2 -1,B0,B0 ; decrement counter 
SHL soul: A6,1,A6 ; maskitl = maski << 1; 
AND .L1X B4,A6,A2 ; condi = codeword & maski 
[A2] MVK -S1 1,A2 ; !'!(! (condi) ) 
AND .L2X B4,A6,B2 ; condit+tl = codeword & maskitl 
ZERO oa al A7 j; zero accumulator 
[B2] MVK “ow 1,B2 ; !(! (condit+l1) ) 
CMPEQ .L1X Bo,A2,Al1 ; (theta == !(! (condi))) 
SHL <S1 A6o,1,A6 ; maski = maskit+l << 1; 
LDH .D1 *A4+4+,A5 i* ali] 
ZERO “hiZ B7 ; zero accumulator 
LOOP: 
CMPEQ .L2 B6,B2,Bl1 ; (theta == !(! (condit+l))) 
[BO] ADD ~D2Z -1,B0,B0 ; decrement counter 
LDH .D1 *A44+4+,B5 i* alit+l] 
{BO] B 32 LOOP ;* for LOOP 
SHL foal: A6o,1,A6 ;* maskitl = maski << 1; 
AND .L1X B4,A6,A2 ;* condi = codeword & maski 
Al] ADD ~L1 A7,A5,A7 ; sum += a[il] 
'A1]SUB .D1 A7,A5,A7 ; sum -= a[il] 
A2] MVK .S1 1,A2 7* !(! (condi) ) 
AND .L2X B4,A6,B2 ;* condit+tl = codeword & maski+l 
B1] ADD .L2 B7,B5,B7 ; sum += a[it+l] 
'B1]SUB .D2 B7,B5,B7 ; sum —-= a[itl] 
B2] MVK .S2 1,B2 7* !'(! (condi+1) ) 
CMPEQ .L1X B6,A2,Al1 7;* (theta == !(!(condi))) 
SHL Past ll A6,1,A6 ;* maski = maskitl << 1; 
LDH .D1 *A4+4+,A5 i** ali] 
; Branch occurs here 
ADD .L1X A7,B7,A4 ; move to return register 
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5.9.7 Comparing Performance 


Table 5-21 compares the performance of all versions of the if-then-else code 


examples. 


Table 5-21. Comparison of If-Then-Else Code Examples 


Code Example Cycles 
Example 5—50_sIf-then-else assembly code (2 xX 32)+6 
Example 5—51__If-then-else assembly code with loop count greater than3 (2 x 32)+4 


Example 5-55 ~ Unrolled if-then-else assembly code (3 x 16)+5 


5-100 


Cycle Count 
70 


68 
53 
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5.10 Live-Too-Long Issues 


When the result of a parent instruction is live longer than the minimum iteration 
interval of a loop, you have a live-too-long problem. Because each instruction 
executes every iteration interval cycle, the next iteration of that parent over- 
writes the register with a new value before the child can read it. Section 5.6.6.1, 
Resource Conflicts, on page 5-65 showed how to solve this problem simply 
by moving the parent to a later cycle. This is not always a valid solution. 


5.10.1 C Code With Live-Too-Long Problem 


Example 5-56 shows C code with a live-too-long problem that cannot be 
solved by rescheduling the parent instruction. Although it is not obvious from 
the C code, the dependency graph in Figure 5-19 on page 5-103 shows a split- 
join path that causes this live-too-long problem. 


Example 5—56. Live-Too-Long C Code 


{ 


} 


int live_long(short a[],short b[],short c, short d, short e) 


int i,sum0,suml,sum,a0,a2,a3,b0,b2,b3; 
short al,bl; 


sum0O = 0; 
suml = 0; 
for (i=0; i<100; i++) { 


a0 = a[i] * c; 
al = a0 >> 15; 
a2 =al * d; 
a3 = a2 a0; 
sum0 += a3; 
bO = b[i] * c; 
bl = bO >> 15; 
b2 = bl * e; 
b3 = b2 + bO; 
suml += b3; 


sum = sum0 + suml; 
return(sum); 
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5.10.2 Translating C Code to Linear Assembly 


Example 5-57 shows the assembly instructions that execute the inner loop in 
Example 5-56. 


Example 5-57. Linear Assembly for Live-Too-Long Inner Loop 


LDH *aptrt+t+,al load ai from memory 
LDH *bptrt++,bi load bi from memory 
PY ai,c,a0 a0 =ai*e 


al = a0 >> 15 
a2 =al*d 
a3 = a2 + a0 
sum0 += a3 

bO = bi*c 
bl = bO >> 15 
b2 =bl *e 
b3 = b2 + bO 


SHR a0,15,al 

PY al,d,a2 

ADD a2,a0,a3 

ADD sum0,a3,sum0 
PY bi,c,b0 

SHR b0,15,b1 

PY bl,e,b2 

ADD b2,b0,b3 


ee ee i ee ee 


ADD suml1,b3,suml suml += b3 
{cntr]SUB entre, L,,cntr ; decrement loop counter 
{cntr]B LOOP ; branch to loop 


5.10.3 Drawing a Dependency Graph 


Figure 5-19 shows the dependency graph for the live-too-long code. This 
algorithm includes three separate and independent graphs. Two of the inde- 
pendent graphs have split-join paths: from a0 to a3 and from b0 to b3. 
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Figure 5-19. Dependency Graph of Live-Too-Long Code 


A side 
LDH 


Split-join path on 
{ 
2 MPY 
2 
ADD 
ADD| , 


Split-join path 
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5.10.4 Determining the Minimum Iteration Interval 


Table 5-22 shows the functional unit resources for the loop. Based on the re- 
source usage, the minimum iteration interval is 2 for the following reasons: 


1 No specific resource is used more than twice, implying a minimum itera- 
tion interval of 2. 


_j A total of five non-.M units on each side also implies a minimum iteration 
interval of 2, because three non-.M units can be used on a side during each 
cycle. 


Table 5-22. Resource Table for Live-Too-Long Code 


(a) A side (b) B side 

Unit(s) Instructions Total/Unit Unit(s) Instructions Total/Unit 
.M1 MPY 1 .M2 MPY 1 

S1 B and SHR 2 .S2 SHR 1 

.D1 LDH 1 .D2 LDH 1 
.L1,.S1,or.D1 2ADDs 2 .L2,.S2,or.D2 2ADDs and SUB 3 
Total non-.M units ) Total non-.M units 5 


However, the minimum iteration interval is determined by both resources and 
data dependency. A loop carry path determined the minimum iteration interval 
of the IIR filter in section 5.7, Loop Carry Paths, on page 5-77. In this example, 
a live-too-long problem determines the minimum iteration interval. 


5.10.4.1 Split-Join-Path Problems 
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In Figure 5—19, the two split-join paths from a0 to a3 and from b0 to b3 create 
the live-too-long problem. Because the ADD a3 instruction cannot be sched- 
uled until the SHR a1 and MPY a2 instructions finish, a0 must be live for at least 
four cycles. For example: 


Lj IfMPY a0 is scheduled on cycle 5, then the earliest SHR a1 can be sched- 
uled is cycle 7. 


_j The earliest MPY a2 can be scheduled is cycle 8. 


(1 The earliest ADD a3 can be scheduled is cycle 10. 


Live-Too-Long Issues 


Because a0 is written at the end of cycle 6, it must be live from cycle 7 to 
cycle 10, or four cycles. No value can be live longer than the minimum iteration 
interval, because the next iteration of the loop will overwrite that value before 
the current iteration can read the value. Therefore, if the value has to be live 
for four cycles, the minimum iteration interval must be at least 4. A minimum 
iteration interval of 4 means that the loop executes at half the performance that 
it could based on available resources. 


5.10.4.2 Unrolling the Loop 


One way to solve this problem is to unroll the loop, so that you are doing twice 
as much work in each iteration. After unrolling, the minimum iteration interval 
is 4, based on both the resources and the data dependencies of the split-join 
path. Although unrolling the loop allows you to achieve the highest possible 
loop throughput, unrolling the loop does increase the code size. 


5.10.4.3 Inserting Moves 


Another solution to the live-too-long problem is to break up the lifetime of a0 
and b0 by inserting move (MV) instructions. The MV instruction breaks up the 
left path of the split-join path into two smaller pieces. 


5.10.4.4 Drawing a New Dependency Graph 


Figure 5-20 shows the new dependency graph with the MV instructions. Now 
the left paths of the split-join paths are broken into two pieces. Each value, a0 
and a0’, can be live for minimum iteration interval cycles. If MPY a0 is sched- 
uled on cycle 5 and ADD a3 is scheduled on cycle 10, you can achieve a mini- 
mum iteration interval of 2 by scheduling MV a0’ on cycle 8. Then a0 is live on 
cycles 7 and 8, and a0’ is live on cycles 9 and 10. Because no values are live 
more than two cycles, the minimum iteration interval for this graph is 2. 
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Figure 5-20. Dependency Graph of Live-Too-Long Code (Split-Join Path Resolved) 


A side B side 
LDH ; LDH 
5 5 
MPY MPY 
2 2 
2 SHR. 9 SHR 
MV / MV 
1 1 
MPY | MPY 
{ 1 
2 2 
ADD ADD 
ADD] , ADD] , 


5.10.5 Linear Assembly Resource Allocation 
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Example 5—58 shows the linear assembly code with the functional units as- 
signed. The choice of units for the ADDs and SUB is flexible and represents 
one of a number of possibilities. One goal is to ensure that no functional unit 
is used more than the minimum iteration interval, or two times. 


The two 2X paths and one 1X path are required because the values c, d, and 
e reside on the side opposite from the instruction that is reading them. If these 
values had created a bottleneck of resources and caused the minimum itera- 
tion interval to increase, c, d, and e could have been loaded into the opposite 
register file outside the loop to eliminate the cross path. 


Live-Too-Long Issues 


Example 5—58. Linear Assembly for Full Live-Too-Long Code 


_live_long: 


LOOP: 


fentr] 
Leper] 


-global _live_long 


.cproc a, b, c, d, e 


reg ai, bi, sum0, suml, sum 
reg aQp; ac0;. ail, a2, a3; bi0;- b0p, bul, bU2, b3, centr 
MVK 100;-cntr ; entr = 100 
ZERO sum0 ; sum0 = 0 
ZERO suml ; suml = 0 
trip 100 
LDH st *att,al ; load ai from memory 
LDH .D2 *b++,bi ; load bi from memory 
PY .M1 ai,.c;,a_0 ; aO =ai-*ece 
SHR a1 a_0,15,a_l ; al = a0 >> 15 
PY M1X a_l,d,a_2 * a2 al *d 
iV -D1 a_0,a0p ; save a0 across iterations 
ADD ea a a_2,a0p,a_3 yj; a3 = a2 + ad 
ADD marl sum0,a_3,sum0 ; sum0 += a3 
PY .M2X bic; _0 ; bO = bi * ci 
SHR 22 b 0,15; 1 ; bl = bO >> 15 
PY .-M2X b_1,e,b_2 ; b2 =bl * e 
iV .D2 b_0,b0p ; save bO across iterations 
ADD L2 b_2,b0p,b_3 ; b3 = b2 + bO 
ADD .L2 suml,b_3,suml ; suml += b3 
SUB son enter, 1, cntr ; decrement loop counter 
B 51 LOOP ; branch to loop 
ADD sum0, suml1, sum ; Add sumi and sumitl for ret value 


.return sum 


.endproc 
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5.10.6 Final Assembly With Move Instructions 


Example 5-59 shows the final assembly code after software pipelining. The 
performance of this loop is 212 cycles (2 x 100 + 11 + 1). 


Example 5-59. Assembly Code for Live-Too-Long With Move Instructions 


LDH ~D1 *R4++,A0 ; load ai from memory 
| | LDH -D2 *B4++,BO ; load bi from memory 
MVK 2o2 100,B2 ; set up loop counter 
LDH Da *A4++,A0 ;* load ai from memory 
| | LDH .D2 *B4++,BO ;* load bi from memory 
ZERO Si Al 7 zero out accumulator 
|| ZERO ~S2 Bl ; zero out accumulator 
LDH + Da *A4++,A0 7** load ai from memory 
| | LDH ~D2 *B4++,BO 7** load bi from memory 
[B2] SUB «OZ B2,1,B2 ; decrement loop counter 
MPY -M1 AO,A6,A3 ;j aO =ai-*ec 
MPY -M2X BO,A6,B10 ; bO = bi*c 
LDH Dd *A4++,A0 7*** load ai from memory 
LDH .D2 *B4++,BO 7*** load bi from memory 
[B2] SUB ~S2 B2,1,B2 ; decrement loop counter 
[B2] B .S1 LOOP ; branch to loop 
SHR -S1 A3,15,A5 ; al = a0 >> 15 
SHR ~S2 B10,15,B5 ; bl = b0O >> 15 
MPY -M1 AO,A6,A3 7;* a0 =ai*ec 
MPY -M2X BO,A6,B10 7* bO = bi *c 
LDH DL *A4++,A0 7**** load ai from memory 
LDH «DZ *B4++,B0 7**** Joad bi from memory 
MPY -M1X A5,B6,A7 ;, a2=alrxad 
MV Di. A3,A2 ; save a0 across iterations 
MPY -M2X B5,A8,B7 ; b2 =bl *e 
MV -D2 B10,B8 ; save bO across iterations 
[B2] SUB oe B2,1,B2 ;* decrement loop counter 
[B2] B oul! LOOP 7* branch to loop 
SHR sol A3,15,A5 7* al = a0 >> 15 
SHR ~S2 B10,15,B5 ;* bl = bO >> 15 
MPY -M1 AO,A6,A3 7** aO = ai*ec 
MPY -M2X BO,A6,B10 7** bO = bi * c 
LDH 3 Da *A4++,A0 7***** load ai from memory 
LDH -D2 *B4++,B0 7***** load bi from memory 
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Example 5-59. Assembly Code for Live-Too-Long With Move Instructions (Continued) 


LOOP: 


SNM PP 


E 


E 


~ 


wns Ss SS ow 


Gas 
wo 


DD 
DD 
HR 
HR 
PY: 
PY 
DH 
DH 


DD 


eel: 
~L2 
.M1X 
-D1 
.M2X 
eD2 
-S2 
ol 


Ll 
~L2 
S1 
22 
.M1 
~M2X 
.D1 
.D2 
Branch occurs here 


.L1X 


A7,A2,A9 
B7,B8,B9 
A5,B6,A7 
A3,A2 
B5,A8,B7 
B10,B8 
B2,1,B2 
LOOP 


Al1,A9,Al1 
B1,B9,Bl1 
A3,15,A5 
B10,15,B5 
AO,A6,A3 
BO, A6,B10 
*A4++,A0 
*B4++,B0 


Al1,B1,A4 


i* aBi= 
;* b3 = 
7;* a2 = 
;* save 
7* be = 
;* save 


a2 + a0 
b2 + b0 
al *d 
aQ across iterations 
bl * e 
bO across iterations 


;** decrement loop counter 
7** branch to loop 


7; sum0 += 


; suml += 
;** al = 
;** bl = 


pe ad 


pee bO 


p Ree 


pReKKERK 


; sum = 


Sit, iS 
=bi*ec 
load ai from memory 
load bi from memory 


sum0O + suml 
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5.11 Redundant Load Elimination 


Filter algorithms typically read the same value from memory multiple times and 
are, therefore, prime candidates for optimization by eliminating redundantload 
instructions. Rather than perform a load operation each time a particular value 
is read, you can keep the value in a register and read the register multiple 
times. 


5.11.1 FIR Filter C Code 


Example 5-60 shows C code for a simple FIR filter. There are two memory 
reads (x[i+j] and h[i]) for each multiply. Because the ‘C6000 can perform only 
two LDHs per cycle, it seems, at first glance, that only one multiply-accumulate 
per cycle is possible. 


Example 5-60. FIR Filter C Code 


void fir(short x[], short h[], short y[]) 
{ 

int i, j, sum; 

for (j = 0; j < 100; jt+) f 


sum = 0; 
for (i = 0; i < 32; i++) 

sum += x[itj] * h[il]; 
y[j] = sum >> 15; 


One way to optimize this situation is to perform LDWs instead of LDHs to read 
two data values at atime. Although using LDW works for the h array, the x array 
presents a different problem because the ’C6x does not allow you to load 
values across a word boundary. 


For example, on the first outer loop (j = 0), you can read the x-array elements 
(0 and 1, 2 and 3, etc.) as long as elements 0 and 1 are aligned on a 4-byte 
word boundary. However, the second outer loop (j = 1) requires reading x-array 
elements 1 through 32. The LDW operation must load elements that are not 
word-aligned (1 and 2, 3 and 4, etc.). 


5.11.1.1 Redundant Loads 
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In order to achieve two multiply-accumulates per cycle, you must reduce the 
number of LDHs. Because successive outer loops read all the same h-array 
values and almost all of the same x-array values, you can eliminate the redun- 
dant loads by unrolling the inner and outer loops. 


For example, x[1] is needed for the first outer loop (x[j+1] with j = 0) and for the 
second outer loop (x{j] with j = 1). You can use a single LDH instruction to load 
this value. 


Redundant Load Elimination 


5.11.1.2 New FIR Filter C Code 


Example 5-61 shows that after eliminating redundant loads, there are four 
memory-read operations for every four multiply-accumulate operations. Now 
the memory accesses no longer limit the performance. 


Example 5-61. FIR Filter C Code With Redundant Load Elimination 


void fir(short x[], short h[], short y[]) 
{ 


int i, Jj, sum0, suml; 
short x0,x1,h0,h1; 


for (3 = 0; j < 100; jt=2) { 


sum0 = 0; 

suml = 0; 

x0 = x[Jl; 

for (i = 0; i < 32; it=2){ 
xl = x[jtitl]; 
ho = h[il]; 
sum0 += x0 * h0; 
suml += xl * hO; 
xO = x[j+it+2]; 
hl = h[itl]; 
sum0 += xl * hil; 


suml += x0 * hl; 
} 

yij] = sum0 >> 15; 

y[jtl] = suml >> 15; 
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5.11.2 Translating C Code to Linear Assembly 


Example 5-62 shows the linear assembly that perform the inner loop of the 
FIR filter C code. 


Element x0 is read by the MPY p00 before it is loaded by the LDH x0 instruc- 
tion; x{j] (the first x0) is loaded outside the loop, but successive even elements 
are loaded inside the loop. 


Example 5-62. Linear Assembly for FIR Inner Loop 


sD2 ey 1++[2],x1 ; xl = x[jtitl1] 
.D1 *ht++[2],h0 ; ho = h[il] 
.M1 x0,h0,p00 7; x0 * hO 
.M1X x1,h0,p10 2 x * “HG 
~L1 p00, sum0, sum0 ; sum0 += xO * hO 
.L2X pl0,suml1, suml ; suml += x1 * HO 
«D1 *x++[2],x0 7; xO = x[Jt+it+2] 
.D2 *h_1++[2],h1 ; hl = h[itl] 

2 x1,h1,p01 ao Stl *® AD 

2X x0,h1,pll 2 x0 * Ad 
.L1X p01, sum0, sum0 ; sum0O += xl * Al 
.L2 pll,suml, suml ; suml += xO * Al 
o2 Ctr, L,crr ; decrement loop counter 
+52 LOOP 7 branch to loop 
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Redundant Load Elimination 


Figure 5-21 shows the dependency graph of the FIR filter with redundant load 


elimination. 


Figure 5-21. Dependency Graph of FIR Filter (With Redundant Load Elimination) 


A side |B side 
‘a LDH re ists 
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- .M2 
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5.11.4 Determining the Minimum Iteration Interval 


Table 5-23 shows that the minimum iteration interval is 2. An iteration interval 
of 2 means that two multiply-accumulates are executing per cycle. 


Table 5-23. Resource Table for FIR Filter Code 


(a) A side (b) B side 

Unit(s) Instructions Total/Unit || Unit(s) Instructions Total/Unit 
.M1 2 MPYs 2 .M2 2 MPYs 2 

S1 0 .S2 B 1 

.D1 2 LDHs 2 .D2 2 LDHs 2 
.L1,.S1,or.D1  2ADDs 2 .L2, .S2, .D2 2 ADDs and SUB 3 
Total non-.M units 4 Total non-.M units 6 

1X paths 2 2X paths 2 


5.11.5 Linear Assembly Resource Allocation 


Example 5-63 shows the linear assembly with functional units and registers 
assigned. 


Example 5-63. Linear Assembly for Full FIR Code 


-global _fir 
Sir: -cproc x, h, y 
.reg x 1, h_l, sum0, suml, ctr, octr 
.reg p00, pOl, pl0, pill, x0, x1, hO, hl, rstx, rsth 
ADD h,2;hn_1 ; set up pointer to h[1] 
MVK 50,octr ; outer loop ctr = 100/2 
MVK 64,rstx ; used to rst x pointer each outer loop 
MVK 64,rsth ; used to rst h pointer each outer loop 
OUTLOOP: 
ADD x,2,x%_1 ; set up pointer to x[j+1] 
SU h_1,2,h ; set up pointer to h[0] 
MVK 16.6 ba ; inner loop ctr = 32/2 
ZERO sum0 ; sum0 = 0 
ZERO suml ; suml = 0 
[octky] SUB octr, 1, cete ; decrement outer loop counter 
LDH .D1 *x++[2],x0 ; x0 = x[J] 
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Example 5-63. Linear Assembly for Full FIR Code (Continued) 


LOOP: -trip 16 


LDH 
LDH 
MPY 
MPY 
A 
A 


DD 
DD 


LDH 
DH 
BY 
PY 
DD 
DD 


Pre se 


[etx] 
[eer] 


WNNNNNNHNWMN 
wimm so 


[octr] 


-D2 *x 1+4+[2],x1 , Xl = x[jtitl] 
-D1 *h++[2],h0 ; ho = h[i] 
-M1 x0,h0,p00 ’ x0 * bo 
-M1X x1,h0,p10 7; xl * hO 
a deel p00, sum0, sum0 ; sum0 += x0 * hO 
~L2X pl0,suml1, suml1 ; suml += xl * hO 
-D1 *x++[2],x0 ; xO = x[j+it+2] 
»D2 *h_14++([2],h1 ; hl = h[itl] 

2 x1,h1,p01 - x1 * Ad 

2X x0,h1,pl1l1 7 «xO * hil 
MX p01, sum0, sum0 , sumO += xl * hl 
~L2 pll,suml, suml ; suml += x0 * hl 
252 ctr, 1,ctr ; decrement loop counter 
52 LOOP ; branch to loop 
sum0,15, sum0 7 sum >> 15 
suml1,15,suml ; suml >> 15 
sum0, *y++ >, ylj] = sum0 >> 15 
suml, *y++ ; ylj+1] = suml >> 15 
x, rstx,x ; reset x pointer to x[j] 
h_1,rsth,h_l ; reset h pointer to h[0] 
OUTLOOP ; branch to outer loop 


5.11.6 Final Assembly 


Example 5—64 shows the final assembly for the FIR filter without redundant 
load instructions. At the end of the inner loop is a branch to OUTLOOP that 
executes the next outer loop. The outer loop counter is 50 because iterations 
j and j + 1 execute each time the inner loop is run. The inner loop counter is 
16 because iterations i andi + 1 execute each inner loop iteration. 


The cycle count for this nested loop is 2352 cycles: 50 (16 x 2+9+6)+2. 
Fifteen cycles are overhead for each outer loop: 


_) Nine cycles execute the inner loop prolog. 
(J Six cycles execute the branch to the outer loop. 


See section 5.13, Software Pipelining the Outer Loop, on page 5-131 for in- 
formation on how to reduce this overhead. 


Optimizing Assembly Code via Linear Assembly 5-115 


Redundant Load Elimination 


Example 5-64. Final Assembly Code for FIR Filter With Redundant Load Elimination 


OUTLOOP : 


[B2] 


[B2] 


[B2] 


[B2] 


[B2] 


Ooo 


.L2X 


50,A2 


80,A3 
82,B6 


*A4++[2],A0 
A4,2,B5 
B4,2,B4 
B4,0,A5 
16,B2 
A2,1,A2 


*A5++[2],Al1 
*B5++[2],Bl 
AQ 
B9 


*BAt+ 
*A4++[2],A0 


N 
Ww 
© 


*A5++[2],A1 
*BOt+ [2], BL 


’ 


’ 


, 


, 


, 


set up outer loop 


used to rst x ptr 
used to rst h ptr 


x0 = x[Jj] 

set up pointer to 
set up pointer to 
set up pointer to 
set up inner loop 


counter 


outer loop 
outer loop 


x[jt+1] 
h[1] 
h[0] 
counter 


decrement outer loop counter 


ho = h[i] 

xl = x[jtitl 
zero out sum0 
zero out suml 


hl = h[it+l] 
x0 = x[Jjt+it+2 
* ho = h[i 


* xl = x[jtit+l] 


© 


@ 


® 
® 


decrement inner loop counter 6) 


;* hl = h[itl] 


* x0 = x[Jj+it+2] 


branch to inner loop 
7** hO = hf[il 


;** xl = x[Jjtitl] 


x0 * ho 


7** hl = h[itl] 
7** xO = x[J+it2] 


, 


, 


, 


xl * hil 
xl * ho 


;* decrement inner loop counter 


* branch to inner loop 


eee HO = Hila 
**e* x1 = x[Jjt+itl] 


x0 * hil 
* 30 * hQ 


;e** hl = h[itl] 


;*** xO = x[Jj+it2] 


© 
© 


® 


;** decrement inner loop counter 
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Example 5-64 Final Assembly Code for FIR Filter With Redundant Load Elimination 


(Continued) 
LOOP: 
ADD .L2X A8,B9,B9 ; suml += xl * hO 
ADD ~L1 A7,A9,A9 ; sum0O += x0 * hO 
MPY .M2 B1,BO0,B7 7* xl * hil 
MPY .-M1X B1,A1,A8 ee 31% RO 
{B2] B S32 LOOP ;** branch to inner loop 
LDH -D1 *A5++(2],Al1 7**** hO = h[i] 
LDH “D2 *B5++[2],Bl1 7***ee XT = x[Jt+itl] 
ADD .L1X B7,A9,A9 ; sum0 += xl * hl 
ADD eLi2 B8,B9,B9 ; suml += x0 * Al 
MPY .M2X  AO,BO,B8 ;* x0 * hl 
MPY .M1 AOQ,A1,A7 7** x0 * hO 
{B2] SUB .S2 BZ, 1,B2 7*** decrement inner loop cntr 
LDH .D2 *B4++[2],B0 p**** H1 = h[itl] 
LDH ~DL *A4++[2],A0 7**e* KO = x[Jt+it2] 
; inner loop branch occurs here 
[A2] B .S1 OUTLOOP ; branch to outer loop @) 
[| SUB ~L1 A4,A3,A4 ; reset x pointer to x[j] 
i SUB «LZ B4,B6,B4 ; reset h pointer to h[0] 
SHR me A9,15,A9 ; sum0 >> 15 @) 
[| SHR ~52 B9,15,B9 ; suml >> 15 
STH .D1 AQ, *A6++ 7 ylj] = sum0 >> 15 8) 
STH 2D B9, *A6++ ; yljtl] = suml >> 15 @) 
NOP 2 ; branch delay slots 3 


, 


outer loop branch occurs here 
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5.12 Memory Banks 


The internal memory of the C6000 family varies from device to device. See 
the TMS320C6000 Peripherals Reference Guide to determine the memory 
blocks in your particular device. This section discusses how to write code to 
avoid memory bank conflicts. 


Most ’C6x devices use an interleaved memory bank scheme, as shown in 
Figure 5-22. Each number in the boxes represents a byte address. A load byte 
(LDB) instruction from address 0 loads byte 0 in bank 0. A load halfword (LDH) 
from address 0 loads the halfword value in bytes 0 and 1, which are also in 
bank 0. An LDW from address 0 loads bytes 0 through 3 in banks 0 and 1. 


Because each bank is single-ported memory, only one access to each bank 
is allowed per cycle. Two accesses to a single bank in a given cycle result in 
amemory stall that halts all pipeline operation for one cycle, while the second 
value is read from memory. Two memory operations per cycle are allowed 
without any stall, as long as they do not access the same bank. 


Figure 5-22. 4-Bank Interleaved Memory 
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8N 8N + 1 8N+2/8N+3 8N+4/8N+5 8N+6/8N+7 
Bank 0 Bank 1 Bank 2 Bank 3 


For devices that have more than one memory block (see Figure 5-23), an 
access to bank 0 in one block does not interfere with an access to bank 0 in 
another memory block, and no pipeline stall occurs. 


Memory Banks 


Figure 5-23. 4-Bank Interleaved Memory With Two Memory Blocks 


Memory 
bloelen 0 1 2 3 4 5 6 4 
8 9 10 11 12 13 14 15 


Bank 0 Bank 1 Bank 2 Bank 3 
Memory 8M |8M+1 8M + 2/8M+3 8M + 4/8M+5 8M + 6/8M + 7 
block 1 
Bank 0 Bank 1 Bank 2 Bank 3 


If each array in a loop resides in a separate memory block, the 2-cycle loop 
in Example 5-61 on page 5-111 is sufficient. This section describes a solution 
when two arrays must reside in the same memory block. 
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5.12.1 FIR Filter Inner Loop 


Example 5-65 shows the inner loop from the final assembly in Example 5-64. 
The LDHs from the h array are in parallel with LDHs from the x array. If x[1] is 
on an even halfword (bank 0) and h[0] is on an odd halfword (bank 1), 
Example 5-65 has no memory conflicts. However, if both x[1] and h[0] are on 
an even halfword in memory (bank 0) and they are in the same memory block, 
every cycle incurs a memory pipeline stall and the loop runs at half the speed. 


Example 5-65. Final Assembly Code for Inner Loop of FIR Filter 


LOOP: 


~L2X A8,B9,B9 ; suml += xl * hO 

Pal Bil A7,A9,A9 ; sumO += x0 * hO 

.M2 B1,B0,B7 7* x1 * hl 

.M1X B1,A1,A8 2m 3 ho 

2o2 LOOP ;** branch to inner loop 
» Dad *A5+4+[2],Al1 7**** hO = h[i] 

-D2 *B5++[2],Bl 7**** x1 = x[Jtitl] 

.L1X B7,A9,A9 ; sumO += x1 * hl 

«Zz B8,B9,B9 ; suml += x0 * hil 

-M2X AO,BO,B8 7* x0 * hl 

.M1 AO,A1,A7 7** xO * HO 

~S2 B2,1,B2 7*** decrement inner loop cntr 
~D2 *B4++[2],BO preee HIS hh patdy 

sD1 *A4++[2],A0 7***K* xO = x[9+it2] 
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Itis not always possible to fully control how arrays are aligned, especially if one 
of the arrays is passed into a function as a pointer and that pointer has different 
alignments each time the function is called. One solution to this problem is to 
write an FIR filter that avoids memory hits, regardless of the x and h array align- 
ments. 


If accesses to the even and odd elements of an array (h or x) are scheduled 
onthe same cycle, the accesses are always on adjacent memory banks. Thus, 
to write an FIR filter that never has memory hits, even and odd elements of the 
same array must be scheduled on the same loop cycle. 
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In the case of the FIR filter, scheduling the even and odd elements of the same 
array on the same loop cycle cannot be done in a 2-cycle loop, as shown in 
Figure 5—24. In this example, a valid 2-cycle software-pipelined loop without 
memory constraints is ruled by the following constraints: 


(_} LDH h0 and LDH h1 are on the same loop cycle. 
(_) LDH x0 and LDH x1 are on the same loop cycle. 


_j MPY p00 must be scheduled three or four cycles after LDH x0, because 
it must read xO from the previous iteration of LDH x0. 


_j} All MPYs must be five or six cycles after their LDH parents. 


_) No MPYs on the same side (A or B) can be on the same loop cycle. 


Figure 5-24. Dependency Graph of FIR Filter (With Even and Odd Elements of 
Each Array on Same Loop Cycle) 


A side B side 


Note: Numbers in bold represent the cycle the instruction is scheduled on. 


The scenario in Figure 5-24 almost works. All nodes satisfy the above 
constraints except MPY p10. Because one parent is on cycle 1 (LDH h0) and 
another on cycle 0 (LDH x1), the only cycle for MPY p10 is cycle 6. However, 
another MPY on the A side is also scheduled on cycle 6 (MPY p00). Other 
combinations of cycles for this graph produce similar results. 
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5.12.2 Unrolled FIR Filter C Code 


The main limitation in solving the problem in Figure 5—24 is in scheduling a 2- 
cycle loop, which means that no value can be live more than two cycles. In- 
creasing the iteration interval to 3 decreases performance. A better solution 
is to unroll the inner loop one more time and produce a 4-cycle loop. 


Example 5-66 shows the FIR filter C code after unrolling the inner loop one 
more time. This solution adds to the flexibility of scheduling and allows you to 
write FIR filter code that never has memory hits, regardless of array alignment 
and memory block. 


Example 5-66. FIR Filter C Code (Unrolled) 


void fir(short x[], short h[], short y[]) 
{ 
int i, j, sum0, suml; 
short x0,;x1,;x2;x3,h0,h1,h2,h3; 
for (j = 0; 3 < 100; jt=2) { 
sumO = 0; 
suml = 0; 
x0 = x[Jl]; 
for (i = 0; i < 32; i+=4) { 
x1 = x[jtit+l]; 
ho = h[i]; 
sum0O += x0 * hO; 
suml += xl * hO; 
x2 = x[j+it2]; 
hl = h[itl]; 
sumO += x1 * hl; 
suml += x2 * hl; 
x3 = x[j+it+3]; 
h2 = h[it2]; 
sum0O += x2 * h2; 
suml += x3 * h2; 
xO = x[Jjt+it+4]; 
h3 = h[it3]; 
sum0 += x3 * h3; 
suml += xO * h3; 
} 
y[j] = sum0 >> 15; 
yfjt1l] = suml >> 15; 
} 
} 
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5.12.3 Translating C Code to Linear Assembly 


Example 5-67 shows the linear assembly for the unrolled inner loop of the FIR 


filter C code. 


Example 5-67. Linear Assembly for Unrolled FIR Inner Loop 


[entry] 
[entx] 


LDH 
LDH 


SUB 
B 


*x++,x1 
*ht++,ho 
x0,h0,p00 
x1,h0,p10 
p00, sum0, sum0 
pl0,suml1, suml 


*x++,x2 
*ht++,hl 
x1,h1,p01 
x2,h1,pl1l1 
p01, sum0, sum0 
pll,suml1, suml 


*xt+,x3 
*ht++,h2 
x2,h2,p02 
*3,h2,pl12 
p02, sum0, sum0 
pl2,suml1, suml 


AR ty HO 
*ht++,h3 
*3,h3,p03 
x0,h3,p13 
p03, sum0, sum0 
pl3,suml1, suml 


centr, ly cnte 
LOOP 


xl = x[Jjtit+l] 
= h[il] 

xO * hO 
* 


x0 = x[Jjt+it4] 
= h[it3] 

x3 * h3 

x0 * h3 

sum0 += x3 * h3 

suml += x0 * h3 


decrement loop counter 
branch to loop 
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5.12.4 Drawing a Dependency Graph 


Figure 5-25 shows the dependency graph of the FIR filter with no memory 
hits. 


Figure 5-25. Dependency Graph of FIR Filter (With No Memory Hits) 


A side B side 


LDH LDH LDH LDH 
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5.12.5 Linear Assembly for Unrolled FIR Inner Loop With .mptr Directive 


Example 5—68 shows the unrolled FIR inner loop with the .mpir directive. The 
.mptr directive allows the assembly optimizer to automatically determine if two 
memory operations have a bank conflict by associating memory access infor- 
mation with a specific pointer register. 


If the assembly optimizer determines that two memory operations have a bank 
conflict, then it will not schedule them in parallel. The .mptr directive tells the 
assembly optimizer that when the specified register is used as amemory point- 
er in aload or store instruction, it is initialized to point at a base location + <off- 
set>, and is incremented a number of times each time through the loop. 


Without the .mpir directives, the loads of x1 and hO are scheduled in parallel, 
and the loads of x2 and h1 are scheduled in parallel. This results in a 50% 
chance of a memory conflict on every cycle. 


Example 5-68. Linear Assembly for Full Unrolled FIR Filter 


aes 


OUTLOOP: 


[octx] 


-global 


.Cproc 


ee Ke. 
Ae 
~r 


AD 
MV. 
MV. 
MV. 


ONN EWM BP 
GHhaA<daacu 


SE 2) 
Le) 
e 
Ky 


iH 
jw) 


eg 
eg 
eg 


NANA U 


'O 
(ar 
K 


eae 

x, h, y 

xl, Bul, sum, suml, ctr, octr 

p00, pOl, p02, p03, p10, pll, pl2, p13 


x0, xl, x2, x3, h0, 


h,2,h_1 ; set up pointer to h[1] 

50;-Oetx ; outer loop ctr = 100/2 

64,rstx ; used to rst x pointer each outer loop 
64,rsth 7 used to rst h pointer each outer loop 
x,2,x_1 ; set up pointer to x[j+1] 

h_1,2,h ; set up pointer to h[0] 

8,ctr ; inner loop ctr = 32/2 

sum0 ; sum0 = 0 

suml suml = 0 


ee x+0 
x 1, x#2 
ney h+0 
Io il Aga) 


*x++[2],x0 


hi, h2; hs, £Stx;,.. Ysth 


decrement outer loop counter 
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Example 5-68. Linear Assembly for Full Unrolled FIR Filter (Continued) 


LOOP: 


Loctr] 


E 


E 


E 


w 


Pre eE PPE EE Pre ee 
poo tio tot 


Pre sEE 
vm U 


Era 8 


DH 
DH 


wiwmm so 


<b ese Tee [2], el 
2 DL *h++[2],h0 
-M1X x0,h0,p00 
-M1 x1,h0,p10 
«Lil p00, sum0, sum0 
~L2X pl0,suml1, suml 
2D2 *x++[2],x2 
~D2 *h_144+[2],h1 

2X x1,h1,p01 

2 *2,h1,p11 
-L1X p01,sum0, sum0 
eliZ pll,suml1, suml 
2DL ky 14++[2],x3 
2Dal *h++[2],h2 

1X *x2,h2,p02 

1 x3, h2,612 
«dare p02, sum0, sum0 
~L2X pl2,suml1, suml 
«DZ *x++[2],x0 
.D2 *h_14+4+[2],h3 

2X x3,h3,p03 

2 x0,h3,p13 
-L1X p03,sum0, sum0 
LZ p1l3,suml1, suml 
wo2 etr,. 1 Cer 
«SZ LOOP 


sum0,15,sum0 
sum1,15,suml1 
sum0, *y++ 
suml, *y++ 
x, CStx,x 
h_1,rsth,h_1l 
OUTLOOP 


hl = h[i+l] 
scl 7 alt 
x2 * hil 


h3 = h[it+3] 
x3 * h3 
x0 * h3 


decrement loop counter 
branch to loop 


sum0 >> 15 

suml >> 15 

y[j] = sum0 >> 15 
y[jt+1] = suml >> 15 
reset x pointer to x[j] 
reset h pointer to h[0] 
branch to outer loop 
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5.12.6 Linear Assembly Resource Allocation 


As the number of instructions in a loop increases, assigning a specific register 
to every value in the loop becomes increasingly difficult. If 33 instructions in 
a loop each write a value, they cannot each write to a unique register because 
the ’C62x and ’C67x have only 32 registers. This would also work on the ’C64x 
which has 64 registers. As a result, values that are not live on the same cycles 
in the loop must share registers. 


For example, in a 4-cycle loop: 


_j Ifa value is written at the end of cycle 0 and read on cycle 2 of the loop, 
it is live for two cycles (cycles 1 and 2 of the loop). 


_) Ifanother value is written at the end of cycle 2 and read on cycle 0 (the next 
iteration) of the loop, itis also live for two cycles (cycles 3 and 0 of the loop). 


Because both of these values are not live on the same cycles, they can occupy 
the same register. Only after scheduling these instructions and their children 
do you know that they can occupy the same register. 


Register allocation is not complicated but can be tedious when done by hand. 
Each value has to be analyzed for its lifetime and then appropriately combined 
with other values not live on the same cycles in the loop. The assembly opti- 
mizer handles this automatically after it software pipelines the loop. See the 
TMS320C6000 Optimizing C/C++ Compiler User’s Guide for more informa- 
tion. 
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5.12.7 Determining the Minimum Iteration Interval 


Based on Table 5—24, the minimum iteration interval for the FIR filter with no 
memory hits should be 4. An iteration interval of 4 means that two multiply/ac- 
cumulates still execute per cycle. 


Table 5-24. Resource Table for FIR Filter Code 


(a) A side (b) B side 

Unit(s) Instructions Total/Unit Unit(s) Instructions Total/Unit 
.M1 4 MPYs 4 .M2 4 MPYs 4 

S1 0 .S2 B 1 

.D1 4LDHs 4 .D2 4 LDHs 4 
.L1,.S1,or.D1 4ADDs 4 .L2,.S2,or.D2 4ADDs and SUB 5 
Total non-.M units 8 Total non-.M units 10 

1X paths 4 2X paths 4 


5.12.8 Final Assembly 


Example 5-69 shows the final assembly to the FIR filter with redundant load 
elimination and no memory hits. At the end of the inner loop, there is a branch 
to OUTLOOP to execute the next outer loop. The outer loop counter is set to 
50 because iterations j and j+1 are executing each time the inner loop is run. 
The inner loop counter is set to 8 because iterations i,i+ 1,i1+2,andi+3 are 
executing each inner loop iteration. 


5.12.9 Comparing Performance 


The cycle count for this nested loop is 2402 cycles. There is a rather large 
outer-loop overhead for executing the branch to the outer loop (6 cycles) and 
the inner loop prolog (10 cycles). Section 5.13 addresses how to reduce this 
overhead by software pipelining the outer loop. 


Table 5-25. Comparison of FIR Filter Code 


Code Example Cycles Cycle Count 
Example 5-64 FIR with redundant load elimination 50 (16 xX 2+9+6)+2 2352 
Example 5-69 FIR with redundant load elimination and no 50 (8 x 44+ 10+6)+2 2402 


memory hits 
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Example 5-69. Final Assembly Code for FIR Filter With Redundant Load Elimination 


and No Memory Hits 
MVK yon 50,A2 ; set up outer loop counter 
MVK -Si 62,A3 ; used to rst x pointer outloop 
| MVK eS2 64,B10 ; used to rst h pointer outloop 
OUTLOOP: 
LDH .D1 *A4++,B5 ; x0 = x[J] 
ADD .L2X A4,4,Bl1 ; set up pointer to x[j+2] 
ADD .L1X B4,2,A8 ; set up pointer to h[1] 
MVK 32 8,B2 ; set up inner loop counter 
{A2] SUB gow A2,1,A2 ; decrement outer loop counter 
LDH .D2 *B1++[2],B0 ; x2 = x[j+it2] 
LDH -D1 *A4++[2],A0 ; xl = x[jtit1] 
ZERO lil AQ j; zero out sum0 
ZERO elaZ BY ; zero out suml 
LDH .D1 *A8++[2],B6 ; hl = h[itl 
LDH .D2 *B4++[2],Al1 ; ho = h[i 
LDH -D1 *A4++[2],A5 ; x3 = x[j+it+3] 
LDH D2 *Bil++[2],B5 ; xO = x[j+it+4] 
LDH D2 *B4++[2],A7 ; h2 = h[it2 
LDH -D1 *A8++[2],B8 ; h3 = h[it3 
{B2] SUB .S2 B2,1,B2 ; decrement loop counter 
LDH .D2 *B1++[2],B0 7* x2 = x[jt+it2] 
LDH “Dal *A4++[2],A0 7* xl = x[jtitl] 
LDH .D1 *A8++[2],B6 7* hl = h[itl 
LDH .D2 *B4++[2],Al 7;* ho = h[i 
MPY .M1X B5,A1,A0 7 xO * ho 
MPY .M2X A0,B6,B6 ; xl * hl 
LDH .D1 *A4++[2],A5 7* x3 = x[Jjt+it3] 
LDH .D2 *B1l++[2],B5 7* xO = x[jt+it4] 
[B2] B .S1 LOOP ; branch to loop 
MPY .M2 BO,B6,B7 - x2) > Td 
MPY .M1 AO,A1,A1 ‘xl. -* ho 
LDH .D2 *B4++[2],A7 7* h2 = h[it2] 
LDH -D1 *A8++[2],B8 7* h3 = h[it3] 
{B2] SUB eS2 B2Z, 1,B2 ;* decrement loop counter 
ADD lid AO,A9,A9 ; sumO += x0 * hO 
MPY -M2X A5,B8,B8 eoxS hs 
MPY .M1X BO,A7,A5 ; x2 * h2 
LDH .D2 *B1++[2],B0O 7** x2 = x[Jt+it2] 
LDH sD *A4++[2],A0 7** xl = x[Jjtitl] 
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Example 5-69. Final Assembly Code for FIR Filter With Redundant Load Elimination 
and No Memory Hits (Continued) 


LOOP: 
ADD ~L2X Al1,B9,B9 ; suml += xl * hO 
ADD -L1X B6,A9,A9 ; sum0O += x1 * hl 
MPY -M2 B5,B8,B7 ; xO * h3 
MPY -M1 A5,A7,A7 o-xS * he 
[B2] LDH -D1 *A8++[2],B6 7** hl = h[iti] 
[B2] LDH sD2 *B4++[2],Al 2ae ho = hi] 
ADD ~L2 B7,B9,B9 ; suml += x2 * hl 
ADD Py ie A5,A9,A9 ; sumO += x2 * h2 
MPY -M1X B5,A1,A0 7 420: * ho 
MPY -M2X AO,B6,B6 7% x1 * hl 
[B2] LDH .D1 *A4+4+[2],A5 7** x3 = x[J+it3] 
[B2] LDH .D2 *B1++[2],B5 7** xO = x[J+it4] 
ADD ~L2X A7,B9,B9 j suml += x3 * h2 
ADD ~L1X B8,A9,A9 j; sumO += x3 * h3 
[B2] B eo LOOP 7* branch to loop 
MPY .M2 BO,B6,B7 ;* x2 * hi 
MPY M1 AO,A1,Al1 ook 1. hid 
[B2] LDH .D2 *B4++[2],A7 p** h2 = h[it2] 
[B2] LDH .D1 *A8++[2],B8 7** h3 = h[it3] 
[B2] SUB wo2 B2,1,B2 7** decrement loop counter 
ADD ~L2 B7,B9,B9 j; suml += x0 * h3 
ADD Lid. AO,A9,A9 7* sumO += x0 * hO 
MPY -M2X A5,B8,B8 PE ORS: XS 
MPY .M1X BO,A7,A5 j;* x2 * h2 
[B2] LDH .D2 *B1++[2],B0O p*** x2 = x[Jtit2] 
[B2] LDH sbi *A4++[2],A0 pe** x1 = x[jtitl] 
; inner loop branch occurs here 
{A2] B ~S2 OUTLOOP ; branch to outer loop 
ih SUB LL A4,A3,A4 ; reset x pointer to x[j] 
| | SUB -L2 B4,B10,B4 ; reset h pointer to h[0] 
1 | SUB . Sil A9,A0,A9 ; sumO -= x0*hO (eliminate add) 
SHR -S1 A9,15,A9 ; sum0 >> 15 
\ | SHR 282 B9,15,B9 7; suml >> 15 
STH -D1 AQ, *A6++ 7 y[j] = sum0 >> 15 
STH ~DL BO, *A6++ ; yljtl] = suml >> 15 
NOP 2 ; branch delay slots 
; outer loop branch occurs here 
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5.13 Software Pipelining the Outer Loop 


In previous examples, software pipelining has always affected the inner loop. 
However, software pipelining works equally well with the outer loop in a nested 
loop. 


5.13.1 Unrolled FIR Filter C Code 


Example 5—70 shows the FIR filter C code after unrolling the inner loop (identi- 
cal to Example 5—66 on page 5-122). 


Example 5—70. Unrolled FIR Filter C Code 


void fir(short x[], short h[], short y[]) 
{ 


int i, j, sum0, suml; 
short «~0,x1,x2,x%3,h0,h1,h2,h3; 


for (j = 0; 3 < 100; jt=2) { 


sum0 = 0; 
suml = 0; 
x0 = x[J]; 


for (i = 0; i < 32; i+=4){ 


xl = x[jtit+l]; 
ho = h[il; 
sum0 xO * hO; 
suml += xl * hO; 
x2 = x[j+it2]; 
hl = h[itl]; 
sum0 += x1 * hil; 
suml += x2 * hil; 
x3 = x[j+it+3]; 
h2 = h[it2]; 
sum0 += x2 * h2; 
suml += x3 * h2; 
xO = x[J+it+4]; 
h3 = h[it+3]; 
sum0 += x3 * h3; 
suml += x0 * h3; 
} 

ylj]l = sum0 >> 15; 

y[j+1] = suml >> 15; 
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5.13.2 Making the Outer Loop Parallel With the Inner Loop Epilog and Prolog 


The final assembly code for the FIR filter with redundant load elimination and 
no memory hits (shown in Example 5-69 on page 5-129) contained 16 cycles 
of overhead to call the inner loop every time: ten cycles for the loop prolog and 
six cycles for the outer loop instructions and branching to the outer loop. 


Most of this overhead can be reduced as follows: 


(4 Put the outer loop and branch instructions in parallel with the prolog. 
(1 Create an epilog to the inner loop. 
(j Put some outer loop instructions in parallel with the inner-loop epilog. 


5.13.3 Final Assembly 
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Example 5-71 shows the final assembly for the FIR filter with a software-pipe- 
lined outer loop. Below the inner loop (starting on page 5-134), each instruc- 
tion is marked in the comments with an e, p, or o for instructions relating to epi- 
log, prolog, or outer loop, respectively. 


The inner loop is now only run seven times, because the eighth iteration is 
done in the epilog in parallel with the prolog of the next inner loop and the outer 
loop instructions. 
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Example 5-71. Final Assembly Code for FIR Filter With Redundant Load Elimination and 
No Memory Hits With Outer Loop Software-Pipelined 


[A2] 


OUTLOOP : 


[B2] 


[B2] 


[B2] 


c 
0 
rae) 


RO 
RO 


Bl 


91 


D2 
91 
S92 
.L2X 


-D1 
.L2X 
.L1X 
«02 
Sl 


-D2 
-D1 
-L1 
L2 


Di: 
-D2 


-D1 
-D2 


D2 
-D1 
S52 


-D2 
-D1 


-D1 
-D2 


.M1X 
-M2X 
-D1 
-D2 


OL 
M2 
-M1 
D2 
-D1 
-S2 


50,A2 


Billi; *Bi5——= 
74,A3 
72,B10 
A6,2,Bl11 


*A44++,B8 
A4,4,Bl1 
B4,2,A8 
8,B2 
A2,1,A2 


*B1++[2],B0 
*A4++[2],A0 
A9 
B9 


*A8++[2],B6 
*B4++[2],Al 


B8,A1,A0 
AO,B6,B6 
*A4++[2],A5 
*BL++ [2] 7.B5 


LOOP 
BO,B6,B7 
AO,A1,Al1 
*B4++[2],A7 
*A8++[2],B8 
B2,1,B2 


; set up outer loop counter 


; push register 

; used to rst x ptr outer loop 
; used to rst h ptr outer loop 
; set up pointer to y[1] 


; x0 = x[3] 

; set up pointer to x[j+2] 

; set up pointer to h[1] 

; set up inner loop counter 

; decrement outer loop counter 


; x2 = x[j+it2] 
; xl = x[jt+itl1] 
; zero out sum0 
; zero out suml 


; hl = h[it+l 
- hO = bp 


; h2 = h[it2 
; h3 = h[it3 
; decrement loop counter 


7* x2 = x[jt+it2] 
7* xl = x[jtitl] 


7* hl = h[itl 
7;* ho = h[i 


; xO * HO 
e 3 Tell 
7* x3 = x[Jjt+it3] 
7* xO = x[j+it4] 


7 branch to loop 

» #2 = hl 

; xl * ho 

7;* h2 = h[it2] 

7* h3 = h[it3] 

7* decrement loop counter 
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Example 5-71. Final Assembly Code for FIR Filter With Redundant Load Elimination and 
No Memory Hits With Outer Loop Software-Pipelined (Continued) 


ADD LiL AO,A9,A9 ; sumO += x0 * hO 
| MPY -M2X A5,B8,B8 7; «x3 * h3 
| MPY .M1X BO,A7,A5 7; x2 * h2 
| LDH .D2 *B1++[2],B0 7** x2 = x[J+1it2] 
| LDH -D1 *R4++[2],A0 7** xl = x[jt+itl] 
LOOP 
ADD ~L2X Al,B9,B9 ; suml += xl * hO 
ADD Loh X B6,A9,A9 ; sumO += xl * hl 
MPY -M2 B5,B8,B7 7 xO * h3 
MPY .M1 A5,A7,A7 7 #3 * h2 
LDH sD *A8++[2],B6 7** hl = h[iti] 
LDH ~D2 *B4++[2],Al 7** ho = h[i]l 
ADD ~L2 B7,B9,B9 ; suml += x2 * hl 
ADD od A5,A9,A9 ; sum0O += x2 * h2 
MPY -M1X B5,A1,A0 PERO & TO 
MPY -M2X AO,B6,B6 Fox * Ad 
LDH -D1 *A4++[2],A5 7** x3 = x[J+it3] 
LDH .D2 *B1++[2],B5 7** xO = x[j+it4] 
ADD ~L2X A7,B9,B9 ; suml += x3 * h2 
ADD .L1X B8,A9,A9 j; sumO += x3 * h3 
[B2] B aol LOOP 7* branch to loop 
MPY .M2 BO,B6,B7 aR SR Oe TL. 
MPY -M1 AO,A1,Al1 2% x1. * ho 
LDH .D2 *B4++[2],A7 ;** h2 = h[it2] 
LDH DL *A8++[2],B8 7** h3 = h[it3] 
[B2] SUB ~o2 B2,1,B2 7** decrement loop counter 
ADD ~L2 B7,B9,B9 ; suml += x0 * h3 
ADD «Ladd: AO,A9,A9 ;* sumO += x0 * hO 
MPY -M2X A5,B8,B8 PR x3. * HS 
MPY .M1X BO,A7,A5 7% 32% he 
LDH .D2 *B1++[2],B0O 7*** x2 = x[Jt+it2] 
LDH . Dd *R4++[2],A0 Pree x = x [tat] 
; inner loop branch occurs here 
ADD ~L2X A1,B9,B9 7e suml += xl * hO 
l| ADD ebiX B6,A9,A9 7e sum0 += xl * hil 
l| MPY .M2 B5,B8,B7 7e x0 * h3 
| | MPY -M1 A5,A7,A7 7e@ x3 * h2 
\| SUB -D1 A4,A3,A4 ;O reset x pointer to x[j] 
l | SUB 2D2 B4,B10,B4 ;O reset h pointer to h[0] 
| | [A2] B -S1 OUTLOOP 7° branch to outer loop 
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Example 5—71. Final Assembly Code for FIR Filter With Redundant Load Elimination and 
No Memory Hits With Outer Loop Software-Pipelined (Continued) 


[A2] 


d 


ERO 
ERO 


NN WW 
| 
I 


.D2 
eel! 
.D1 
~L2X 
-S1X 
92 


.L2X 
.L1X 
«D2 
-D1 
o1 


~L2 
-S1 
-D1 
.D2 


-S2 
-D1 
D2 


-D1 
D2 
ol 
S52 


B7,B9,B9 
A5,A9,A9 
*A44++,B8 
A4,4,B1 
B4,2,A8 
8,B2 


A7,B9,B9 
B8,A9,A9 
*B1t++ [2] 
*A4++ [2] 
AG, 1,2 


B7,B9,B9 
A9,1 
*A8++([2 
*B4++ [2 


B9,15,B9 
*AA4++ [2 
*B1++ [2 


AQ, *A6++ 
B9, *B11+ 
AQ 
BQ 


,BO 


; outer loop branch occurs here 


ie 
je 
7P 
Fie) 
Fe) 
Fae) 


je 
je 
7P 
7P 
Fie) 


ie 
je 
7P 
7P 


je 
7P 
7P 


je 
je 
Fie) 
Fe) 


suml1 
sum0 


x0 


set up pointer to x[j+2] 
set up pointer to h[1] 
set up inner loop counter 


suml 4 


sum0 


x2 
SL. 


decrement outer loop counter 


suml 
sum0 


hl 
ho 


y[3j] 


y{jtl] = 
zero out sum0 
zero out suml 


+= x2 * hl 
+= x2 * h2 
x[j] 


x[jtitl] 


+= x0 * h3 
>> 15 
h[itl] 
h[i 


x(}+i+3] 
x(}tit4] 


= sum0 >> 15 
suml >> 15 


5.13.4 Comparing Performance 


Table 5-26. Comparison of FIR Filter Code 


Code Example 


Example 5-64 


Example 5-69 


Example 5-71 


The improved cycle count for this loop is 2006 cycles: 50 ((7 x 4) +6 + 6) +6. The 
outer-loop overhead for this loop has been reduced from 16 to 8 (6 + 6 — 4); 
the —4 represents one iteration less for the inner-loop iteration (seven instead 


of eight). 


FIR with redundant load elimination 


Optimizing Assembly Code via Linear Assembly 


FIR with redundant load elimination and no memory 
hits 


FIR with redundant load elimination and no memory 
hits with outer loop software-pipelined 


Cycles Cycle Count 
50 (16 x 2+9+6)+2 2352 
50 (8 x 4+10+6)+2 2402 
50 (7 x 4+6+6)+6 2006 
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5.14 Outer Loop Conditionally Executed With Inner Loop 
Software pipelining the outer loop improved the outer loop overhead in the 
previous example from 16 cycles to 8 cycles. Executing the outer loop condi- 
tionally and in parallel with the inner loop eliminates the overhead entirely. 


5.14.1 Unrolled FIR Filter C Code 


Example 5-72 shows the same unrolled FIR filter C code that used in the 
previous example. 


Example 5-72. Unrolled FIR Filter C Code 


void fir(short x[], short h[], short y[]) 
{ 


int i, j, sum0, suml; 
short x0,x1,x%2,x3,h0,h1,h2,h3; 


for (j = 0; 3 < 100; jt=2) { 


sum0O = 0; 
suml = 0; 
x0 = x[Jl; 


for (i = 0; i < 32; i+=4){ 


xl = x[jt+itl]; 
ho = h[ilj; 
sum0 x0 * hO; 
suml xl * h0; 
x2 = x[j+it2]; 
hl = h[itl]; 
sum0O += xl * hl; 
suml += x2 * hl; 
x3 = x[j+it+3]; 
h2 = h[it2]; 
sum0O += x2 * h2; 
suml1 xS x hes 
xO = x[Jjt+it+4]; 
h3 = h[it3]; 
sum0 += “3 * h3; 
suml += x0 * h3; 
} 

y[j] = sum0 >> 15; 

yfjt1l] = suml >> 15; 
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5.14.2 Translating C Code to Linear Assembly (Inner Loop) 


Example 5—73 shows alist of linear assembly for the inner loop of the FIR filter 
C code (identical to Example 5-67 on page 5-123). 


Example 5—73. Linear Assembly for Unrolled FIR Inner Loop 


[entry] 
[entx] 


LDH 
LDH 


SUB 
B 


*x++,x1 
*ht++,ho 
x0,h0,p00 
x1,h0,p10 
p00, sum0, sum0 
pl0,suml1, suml 


*x++,x2 
*ht++,hl 
x1,h1,p01 
x2,h1,pl1l1 
p01, sum0, sum0 
pll,suml1, suml 


*xt+,x3 
*ht++,h2 
x2,h2,p02 
*3,h2,pl12 
p02, sum0, sum0 
pl2,suml1, suml 


AR ty HO 
*ht++,h3 
*3,h3,p03 
x0,h3,p13 
p03, sum0, sum0 
pl3,suml1, suml 


cntr, Ly CcnLre 
LOOP 


xl = x[Jjtit+l] 
= h[il] 

xO * hO 
* 


x0 = x[Jjt+it4] 
= h[it3] 

x3 * h3 

x0 * h3 

sum0 += x3 * h3 

suml += x0 * h3 


decrement loop counter 
branch to loop 
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5.14.3 Translating C Code to Linear Assembly (Outer Loop) 


Example 5—74 shows the instructions that execute all of the outer loop func- 
tions. All of these instructions are conditional on inner loop counters. Two 
different counters are needed, because they must decrement to 0 on different 
iterations. 


_j The resetting of the x and h pointers is conditional on the pointer reset 
counter, prc. 


_j The shifting and storing of the even and odd y elements are conditional on 
the store counter, sctr. 


When these counters are 0, all of the instructions that are conditional on that 
value execute. 


[) The MVK instruction resets the pointers to 8 because after every eight 
iterations of the loop, a new inner loop is completed (8 x 4 elements are 
processed). 


(1 The pointer reset counter becomes 0 first to reset the load pointers, then 
the store counter becomes 0 to shift and store the result. 


Example 5—74. Linear Assembly for FIR Outer Loop 


(sctr SUB sctr,1,sctr ; dec store lp cntr 

'sctr SHR sum07,15,y0 ; (sum0 >> 15) 

lsctx SHR sum17,15,yl1l ; (suml >> 15) 

'sctr STi yO, *yt++[2] 7 yljl = (sum0 >> 15) 

'sctr STi yl, *y_1++[2] 7 yljtl] = (suml >> 15) 
lsctr MVK A, Ssctr ; reset store lp cntr 

[pctr SUB petr,l,petrr ; dec pointer reset lp cntr 
!petr SUB “x, cstx2,x ; reset x ptr 

!petr SUB x 1, rstxl,x_1 ; reset x_l ptr 

!petr SUB h, rsthl1,h ; reset h ptr 

'pctr SUB h 1l,rsthe,h_] ; reset h_1l ptr 

'pCcer MVK 4,pctr ; veset pointer reset lp cntr 


5.14.4 Unrolled FIR Filter C Code 


The total number of instructions to execute both the inner and outer loops is 
38 (26 for the inner loop and 12 for the outer loop). A 4-cycle loop is no longer 
possible. To avoid slowing down the throughput of the inner loop to reduce the 
outer-loop overhead, you must unroll the FIR filter again. 


Example 5—75 shows the C code for the FIR filter, which operates on eight 
elements every inner loop. Two outer loops are also being processed together, 
as in Example 5—72 on page 5-136. 
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Example 5—75. Unrolled FIR Filter C Code 
void fir(short x[], short h[], short y[]) 
{ 
int i, j, sum0, suml; 
short x0,x1,x2,x3,x4,x5,x6,x7,h0,h1,h2,h3,h4,h5,h6,h7; 
for (j = 0; 3 < 100; jt=2) { 
sum0O = 0; 
suml = 0; 
x0 = x[jl; 
for (i = 0; i < 32; i+=8) { 
xl = x[jtit+l]; 
ho = h[il; 
sum0O += x0 * hO; 
suml += xl * hO; 
x2 = x[Jj+it+2]; 
hl = h[itl]; 
sum0 += xl * hil; 
suml 4 x2 * hi; 
x3 = x[j+it+3]; 
h2 = h[it2]; 
sum0 += x2 * h2; 
suml += x3 * h2; 
x4 = x[j+it+4]; 
h3 = h[it+3]; 
sum0 += x3 * h3; 
suml += x4 * h3; 
x5 = x[j+it+5]; 
h4 = h[it4]; 
sum0 += x4 * h4; 
suml += x5 * h4; 
x6 = x[j+it+6]; 
h5 = h[it5]; 
sum0 += x5 * hd; 
suml += x6 * h5; 
x7 = x[Jjt+it+7]; 
h6 = h[it+6]; 
sum0 += x6 * h6; 
suml 4 x7 * h6; 
x0 = x[j+it+8]; 
h7 = h[it+7]; 
sum0 4 S07 > ee 
suml 4 xO * RTs 
} 
y[j] = sum0O >> 15; 
y[j+1] = suml >> 15; 
} 
} 
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5.14.5 Translating C Code to Linear Assembly (Inner Loop) 
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Example 5-76 shows the instructions that perform the inner and outer loops 
of the FIR filter. These instructions reflect the following modifications: 


LL] 
L] 


LDWs are used instead of LDHs to reduce the number of loads in the loop. 
The reset pointer instructions immediately follow the LDW instructions. 


The first ADD instructions for sum0 and sum1 are conditional on the same 
value as the store counter, because when sctr is 0, the end of one inner 
loop has been reached and the first ADD, which adds the previous sum07 
to p00, must not be executed. 


The first ADD for sum0 writes to the same register as the first MPY p00. 
The second ADD reads p00 and p01. At the beginning of each inner loop, 
the first ADD is not performed, so the second ADD correctly reads the 
results of the first two MPYs (p01 and p00) and adds them together. For 
other iterations of the inner loop, the first ADD executes, and the second 
ADD sums the second MPY result (p01) with the running accumulator. The 
same is true for the first and second ADDs of sum1. 


Outer Loop Conditionally Executed With Inner Loop 


Example 5—76. Linear Assembly for FIR With Outer Loop Conditionally Executed 
With Inner Loop 


LDW *ht++[2],h01 ; h[itO] & h[it+l 
LDW *h_1++[2],h23 ; h[it2] & h[it+3 
LDW *ht++[2],h45 ; h[it4] & h[it5 
LDW .*h_1++[2],h67 ; h[ité] & h[it+7 
LDW *x++[2],x01 ; x[jt+it0] & x[Jjtit+l] 
LDW *x 14+4+[2],x23 ; x[jtit2] & x[Jj+it+3] 
LDW *x++[2],x45 ; x[j+it4] & x[j+it5] 
LDW *x 14++[2],x67 ; x[jtit6] & x[Jjt+it+7] 
LDH *x, x8 ; x[jtit8] 
[setx] SUB sctxr, 1,sctr ; dec store lp cntr 
[isetr] SHR sum07,15,y0 ; (sum0O >> 15) 
L.Serrl SHR sum17,15,yl1 ; (suml >> 15) 
[isetr] STH yO, *yt++[2] 7 ylj] = (sum0 >> 15) 
[isetr], STH yl, *y_1t++[2] 7 yljtl] = (suml >> 15) 
V x01,x01b ; move to other reg file 
PYLH h01,x01b, p10 ; plO = h[it0O]*x[j+it+l 
[sectr] ADD pl0,suml17,p10 ; suml(p10) = p10 + suml 
PYHL h0O1,x23,pl11 ; pll = h[itl]*x[j+i+2 
ADD pll1,pl10,suml1l 7 suml += pll 
PYLH h23,x23,pl2 ; pl2 = h[it2]*x[ j+i+3 
ADD pl2,suml11,suml12 ; suml += pl2 
PYHL h23,x45,p13 ; pl3 = h[it3]*x[ j+i+4 
ADD pl3,suml12,suml13 ; suml += pl3 
PYLH h45,x45,pl14 ; pl4 = h[it4]*x[j+it+5 
ADD pl4,sum13,suml14 ; suml += pl4 
PYHL h45,x67,p15 ; plS = h[it5]*x[j+i+6 
ADD pl5,sum14,sum15 ; suml += pl5 
PYLH h67,x67,pl16 ; pl6 = h[it6]*x[j+it+7 
ADD pl6,sum15,suml16 ; suml += pl6 
PYHL h67,x8,pl17 ; pl7 = h[it7]*x[j+it+8 
ADD pl7,suml6,suml17 ; suml += pl7 
PY h01,x01,p00 ; pOO = h[it0O]*x[j+i+0 
[sctr] ADD p00, sum07, p00 7 sum0(p00) = pOO + sum0d 
PYH h01,x01,p01 ; pOl = h[itl]*x[j+it+l 
ADD p01,p00, sum01 ; sum0 += pol 
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Example 5-76. Linear Assembly for FIR With Outer Loop Conditionally Executed 
With Inner Loop (Continued) 


PY h23,x23,p02 ; p02 = h[it2]*x[ j+it2 
ADD p02,sum01, sum02 , sum0 += p02 
PYH h23,x23,p03 ; p03 = h[it3]*x[j+it3 
ADD p03,sum02, sum03 ; sum0 += p03 
PY h45,x45,p04 ; p04 = h[it4]*x[j+i+4 
ADD p04,sum03, sum04 ; sum0 += p04 
PYH h45,x45,p05 ; pOS = hf[it5]*x[j+it5 
ADD p05,sum04, sum05 , sum0 += p05 
PY h67,x67,p06 ; p06 = h[it6]*x[j+it+6 
ADD p06, sum05, sum06 , sum0 += p06 
PYH h67,x67,p07 ; pO7 = hfit7]*x[j+it7 
ADD p07, sum06, sum07 ; sum0 += p07 
'sctr VK 4,sctr ; reset store lp cntr 
[pctr SUB pcetr,1,pctr ; dec pointer reset lp cntr 
'pctr SUB x, rstx2,x ; reset x ptr 
!pctr SUB x _1,rstxl,x_1 ; reset x_l ptr 
'pcetr SUB h,rsthl,h ; reset h ptr 
foctr SUB hy L,rsth2,;h_1 ; reset h_1 ptr 
'pcetr MVK 4,pctr ; reset pointer reset lp cntr 
(octr SUB ectr,1,0ctr ; dec outer lp cntr 
[octr B LOOP ; Branch outer loop 


5.14.6 Translating C Code to Linear Assembly (Inner Loop and Outer Loop) 


Example 5—77 shows the linear assembly with functional units assigned. (As 
in Example 5-68 on page 5-125, symbolic names now have an A or B in front 
of them to signify the register file where they reside.) Although this allocation 
is one of many possibilities, one goal is to keep the 1X and 2X paths toa 
minimum. Even with this goal, you have five 2X paths and seven 1X paths. 


One requirement that was assumed when the functional units were chosen 
was that all the sum0 values reside on the same side (A in this case) and all 
the sum1 values reside on the other side (B). Because you are scheduling 
eight accumulates for both sum0 and sum1 in an 8-cycle loop, each ADD must 
be scheduled immediately following the previous ADD. Therefore, it is undesir- 
able for any sum0 ADDs to use the same functional units as sum1 ADDs. 


One MV instruction was added to get x01 on the B side for the MPYLH p10 
instruction. 
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Example 5—77. Linear Assembly for FIR With Outer Loop Conditionally Executed 
With Inner Loop (With Functional Units) 


-global _fir 

fir: -cproc x, h, y 
.reg x1, h_1l, yl, octr, pcetr, sctr 
.reg sum01, sum02, sum03, sum04, sum05, sum06, sum07 
.reg sumll, suml12, suml13, suml4, suml15, suml6, suml7 
.reg p00, pOl, p02, p03, p04, p05, pdb, pd7 
.reg plo, pll, p12, p13, p14, p15, plé6, pl7 
.reg xOlb, x01, «x23, x45, x67, x8, hO1, h23, h45, h67 
.reg yO, yl, rstxl, rstx2, rsthl, rsth2 
ADD x, 4,%_1 ; point to x[2] 
ADD h,4,h_l ; point to h[2] 
ADD y,2,y_l ; point to y[1] 
MVK 60,rstxl ; used to rst x pointer each outer loop 
MVK 60,rstx2 ; used to rst x pointer each outer loop 
MVK 64,rsthl ; used to rst h pointer each outer loop 
MVK 64,rsth2 ; used to rst h pointer each outer loop 
MVK 201,octr ; loop ctr = 201 = (100/2) * (32/8) + 1 
MVK 4,pcetr ; pointer reset lp cntr = 32/8 
MVK 5,sctr ; veset store lp cntr = 32/8 + 1 
ZERO sum07 ; sum07 = 0 
ZERO sum17 ; suml7 = 0 
mptr x; x+0 
mptr x_l, x+4 
mptr h, h+0 
mptr h_1l, h+4 

LOOP: trap: 8 
LDW .D1T1 *h++[2],hOl ; h[it0O] & h[itl 
LDW .D2T2 *h_1++[2],h23; h[it2] & h[it3 
LDW .D1T1 *h++[2],h45 ; h[it4] & h[it5 
LDW .D2T2 *h_1++[2],h67; h[it6] & h[it7 
LDW .D2T1 *x++[2],x01 ; x[jtit+0] & x[j+it+l1] 
LDW .D1T2 *x 14+4+[2],x23; x[Jj+it2] & x[j+it+3] 
LDW .D2T1 *xt++[2],x45 7 x[Jj+it4] & x[j+it5] 
LDW .D1T2 *x 14+4+[2],x67; x[j+it+6] & x[j+it7] 
LDH -D2T1 *x, x8 ; x[jtit+8] 

[sctr] SUB ~S1 sctr,1,;setr ; dec store lp cntr 

(isetel SHR Sl sum07,15,y0 ; (sum0O >> 15) 

{!sctr] SHR ow sum17,15,y1 ; (suml >> 15) 

[setrl STu ~D1 yO, *yt++[2] + y[j] = (sum0 >> 15) 

{!sctr] STH .D2 yl,*y_1++[2] ; yfj+1] = (suml >> 15) 
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Example 5-77. Linear Assembly for FIR With Outer Loop Conditionally Executed 
With Inner Loop (With Functional Units) (Continued) 


[setr] 


[sctr] 


PYLH 


PYLH 


PYH 


PYH 


PYH 


.L2X 
-M2X 


x01,x01b 
h01,x01b, p10 


pl0,suml17,p10 


h01,x23,pl1l1 
pll1,pl10,suml11 


h23:, x23, pl2Z 
pl2,sumil1,suml12 


h23,x45,p13 
pl3,suml12,suml13 


h45,x45,pl14 
pl4,suml13,suml14 


h45,x67,p15 
pl5,sum14,sum15 


h67,x67,pl6 
pl6,sum15, suml16 


h67,x8,pl17 
pl7,sum16,suml17 


hO1,x01,p00 
p00, sum07, p00 


hO1,x01,p01 
p01,p00, sum01 


h23,x23,p02 
p02,sum01, sum02 


h23,%23,p03 
p03,sum02, sum03 


h45,x45,p04 
p04, sum03, sum04 


h45,x45,p05 
p05,sum04, sum05 


move to other reg file 
h[it+0]*x[Jj+it+1] 


plo = 


suml (p 


10) 


h[{itl]*x[ j+it+2 
+= pll 


h[it2]*x[j+it+3 
pl2 


h[i+3]*x[j+it+4 
+= p13 


h[it4]*x[ j+it5 
+= p14 
h{it+5]*x[j+it+6 


h{it6]*x[j+it7 


hfit+t7]*x[j+its 


= p10 + suml 


= pl7 
h[it+0]*x[j+it+0 
00) = poO + sum0 
hfitl]*x[jt+itl 
= pol 
h[it2]*x[j+it2 
= pod2 
h[it+3]*x[j+it+3 
= pd3 

h[it+4]*x[ 3+i+4 
= pod4 
h[it5]*x[jt+it5 
= pod 
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Example 5—77. Linear Assembly for FIR With Outer Loop Conditionally Executed 
With Inner Loop (With Functional Units)(Continued) 


MPY .M2 h67,x67,p06 ; p06 = h[it6]*x[j+it+6] 
ADD ealx p06, sum05, sum06 ; sum0O += p06 
MP YH .M2 h67,x67,p07 ; pO? = h[it7]*x[j+it7] 
ADD .L1X p07, sum06, sum07 ; sum0O += p07 
lsectr MVK BL 4,sctr ; reset store lp cntr 
[petr SUB oo petr,1,pctr ; dec pointer reset lp cntr 
'pctr SUB -82 x,rstx2,x ; reset x ptr 
Ipetr SUB ~S1 x1, Ystxl;x_t ; reset x_l ptr 
'petr SUB aol h, rsthi,h ; reset h ptr 
'petr SUB aoe h_1l,rsth2,h_1 ; reset h_l ptr 
'pctr MVK Si 4,pctr ; reset pointer reset lp cntr 
[octx SUB oz octr,1,o0ctr ; dec outer lp cntr 
[octr B 92 LOOP ; Branch outer loop 
.endproc 
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5.14.7 Determining the Minimum Iteration Interval 


Based on Table 5-27, the minimum iteration interval is 8. An iteration interval 
of 8 means that two multiply-accumulates per cycle are still executing. 


Table 5-27. Resource Table for FIR Filter Code 


(a) A side (b) B side 

Unit(s) Total/Unit Unit(s) Total/Unit 
.M1 8 .M2 8 
S1 7 S2 6 
.D1 5 .D2 6 
.L1 8 .L2 8 
Total non-.M units 20 Total non-.M units 20 
1X paths 7 2X paths 5 


5.14.8 Final Assembly 


Example 5—78 shows the final assembly for the FIR filter with the outer loop 
conditionally executing in parallel with the inner loop. 


5-146 


Outer Loop Conditionally Executed With Inner Loop 


Example 5—78. Final Assembly Code for FIR Filter 


[Al] 


[!Al1] 


[!Al] 
[!Al] 


[!Al1] 


[A2] 


[!A2] 


[A2] 


.L2X 


-S2X 


.M2X 


.M1X 


B4,A0 
B4,4,B2 
A4,Bl 
A4,4,A4 
200,B0 


*A4++[2],B9 
*B1++[2],A10 
4,Al 


Ao,2,B6 


*A0++[2],A9 
‘Bott [2 ]i7.Bs 
A4,A3,A4 


B1,B14,Bl 
AO,A5,A0 
*B1,A8 


A10,0,B8 
5,A2 


A8,B8,B4 
B2,.B5S;B2 
A8,B9,A14 


A8,A10,A7 
B7,B9,B13 
A2,1,A2 
Bll 


B11,15,B11 
B7,B9,B9 
A8,A10,A10 
B4,B11,B4 
*A4++[2],B9 
*B1++[2],A10 
A10 


’ 


’ 


point to h[0] & h[1] 

point to h[2] & h[3] 

point to x[j] & x[jt1] 
point to x[j+2] & x[j+3] 
set lp ctr ((32/8)*(100/2)) 


x[j+it2] & x[jtit3] 
x[j+it+0] & x[jtitl1] 
set pointer reset lp cntr 


h[it2] & h[it3] 
h[it0O] & h[it1] 


used to reset x ptr (16*4-4) 
used to reset x ptr (16*4-4) 
x[Jj+it4] & x[j+it5] 
x[Jj+it6] & x[j+it7] 


dec pointer reset lp cntr 
used to reset h ptr (16*4) 
used to reset h ptr (16*4) 
point to y[jt+l] 


h[it4] & h[it5] 
h[it6] & h[it7] 
reset x ptr 


reset x ptr 
reset h ptr 
x[Jj+it+8] 


move to other reg file 
set store lp cntr 


plO = h[it0]*x[j+it+1] 
reset h ptr 
pll = h[itl]*x[j+it2] 


poO = h[it0]*x[j+i+0] 

pl2 = h[it+2]*x[J+it+3] 

dec store lp cntr 

zero out initial accumulator 


(Bsuml >> 15) 
poO2 = h[it2]*x[j+it2] 
pOl = h[itl]*x[j+it+1] 


; suml(pl10) = p10 + suml 

p* x[j+it2] & x[j+it3] 

p* x[Jtit0] & x[j+i+1] 

; zero out initial accumulator 
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Example 5—78. Final Assembly Code for FIR Filter (Continued) 


LOOP: 
[!A2] 
[BO] 


[A2] 


[Al] 


[BO] 


[!Al1] 


PNUNPEPrEE 


Snug gs pe 


FPANnNPPESE 


PYH 


PYHL 


PYHL 
PYLHA 


PYLH 


3 
| 


3 
| 


O 


A10,15,A12 
BO,1,B0 
B7,B9,B13 
A7,A10,A7 
B7,A11,A10 
Al14,B4,B7 
‘Bete [2 ],.B7 
*A0++[2],A8 


A10,A7,A13 
A9,B10,B12 
A9,A11,A10 
B13,B7,B7 
*B1l++[2],A11 
*A4++[2],B10 
Al,1,Al 


LOOP 
A9,A11,A11 
B9,A13,A13 
B8,B10,B13 
A10,B7,B7 
*A0++[2],A9 
*B2++[2],B8 
A4,A3,A4 


B8,B10,B11 
A9,A11,A11 
B13,A13,A9 
A10,B7,B7 
B1,B14,Bl 
AO, A5,A0 
*B1,A8 


4,A2 
B8,B10,B13 
Al11,A9,A9 
B8,A8,A9 
B12,B7,B10 
BIL, *Bo+t+ [ 
Al2, *A6++[ 
Al10,0,B8 


A11,A9,A12 
B13,B10,B8 
A8,B8,B4 
4,Al 
B2,B5,B2 
A8,B9,A14 


, 


’ 


* 


* 


* 


o* 


, 


ox 


’ 


7 * 


, 


ox 


, 


o* 


, 


7x 


, 


’ 


, 


+ + F 


(Asum0 >> 15) 

dec outer lp cntr 
p03 = h[it+3]*x[j+it+3] 
sum0 (p00) = p00 + sum0 
pl3 = h[it+3]*x[j+it4] 


h[it2] & h[it3] 
h[it0O] & h[it1] 


sum0 += pOl 
pl5 = h[it5]*x[j+it6] 
pl4 = h[it4]*x[j+it5] 
suml += pl2 
x[j+it4] & x[j+it5] 
x[Jj+it6] & x[jt+it7] 
dec pointer reset lp cntr 


Branch outer loop 


p04 = h[it4]*x[j+it4] 
sum0 += p02 
pl6 = h[it+6]*x[j+it+7] 
suml += p13 
h[it4] & h[it+5] 
h[it6] & h[it+7] 


reset x ptr 


p06 = h[it+6]*x[j+it+6] 
poOS = h[it5]*x[j+it5] 
sum0 += p03 
suml += p14 


reset = ptr 
reset h ptr 
x[Jj+1i+8] 


reset store lp cntr 


pO? = h[it7]*x[j+it+7] 
sum0 += p04 

pl7 = h[it7]*x[j+it8] 
suml += p15 

y[jt+1] = (Bsuml >> 15) 
y[j] = (Asum0 >> 15) 


move to other reg file 


sum0 += p05 

suml += p16 

plo = h[it0]*x[j+it1] 

reset pointer reset lp cntr 
reset h ptr 

pll = h[itli]*x[j+it2] 
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Example 5—78. Final Assembly Code for FIR Filter (Continued) 


ADD -L2X A9,B8,B11 ; suml += pl7 
ADD .L1X B11,A12,A12 ; sum0 += p06 
MP Y .M1 A8,A10,A7 7* pOO = h[it0]*x[j+i+0] 
MPYLH .M2 B7,B9,B13 7* pl2 = h[it2]*x[j+it+3] 
{A2] SUB od A2,1,A2 7* dec store lp cntr 
ADD .L1X B13,A12,A10 ; sum0 += p07 
{[!A2] SHR S2 B11,15,Bl11 ;* (Bsuml >> 15) 
MP Y .M2 B7,B9,B9 7* p02 = h[it2]*x[j+i+2] 
MP YH .M1 A8,A10,A10 7* pOl = h[itl]*x[j+it+1] 
[A2] ADD ~L2 B4,B11,B4 7* suml(p10) = p10 + suml 
LDW .D1 *R4++[2],B9 7** x[Jtit2] & x[j+it3] 
LDW .D2 *B1l++[2],A10 7** x[j+it0] & x[j+it+l1] 
;Branch occurs here 
[!A2] SHR .S1 A10,15,A12 ; (Asum0 >> 15) 
[!A2] STH .D2 B11, *B6++[2] + yl[jt1] = (Bsuml >> 15) 
|| [!A2] STH -D1 Al2, *A6++[2] ; ylj] = (Asum0 >> 15) 


5.14.9 Comparing Performance 


The cycle count of this code is 1612: 50 (8 x 4+0) +12. The overhead due 
to the outer loop has been completely eliminated. 


Table 5-28. Comparison of FIR Filter Code 


Code Example Cycles Cycle Count 
Example 5-61 FIR with redundant load elimination 50 (16 x 24+9+6)+2 2352 
Example 5-69 FIR with redundant load elimination and no memory 50 (8 x 4+10+6)+2 2402 

hits 
Example 5-71 FIR with redundant load elimination and no memory 50 (7 xX 4+6+6)+6 2006 

hits with outer loop software-pipelined 
Example 5-74 FIR with redundant load elimination and no memory 50 (8 x 4+0)+12 1612 

hits with outer loop conditionally executed with inner 

loop 
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Chapter 6 


’C64x Programming Considerations 


This chapter covers material specific to the TMS320C64x series of DSPs. It 
builds on the material presented elsewhere in this book, with additional infor- 
mation specific to the VelociT1.2 extensions that the ’C64x provides. 


Before reading this chapter, familiarize yourself with the programming con- 
cepts presented earlier for the entire C6000 family, as these concepts also ap- 


ply to the 'C64x. 


The sample code that is used in this chapter is included on the Code Genera- 
tion Tools and Code Composer Studio CD-ROM. When you install your code 
generation tools, the example code is installed in the c6xtools directory. Use 
the code in that directory to go through the examples in this chapter. 
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6.1 Overview of ’C64x Architectural Enhancements 


6.1.1 


The 'C64x is a fixed-point digital signal processor (DSP) and is the first DSP 
to add VelociT|.2 extensions to the existing high-performance VelociTI archi- 
tecture. VelociTl.2 extensions provide the following features: 


(} Greater scheduling flexibility for existing instructions 
Greater memory bandwidth with double-word load and store instructions 
Support for packed 8-bit and 16-bit data types 


Support for non-aligned memory accesses 


Lj 
L] 
LL] 
L) 


Special purpose instructions for communications-oriented applications 


Improved Scheduling Flexibility 


The ’C64x improves scheduling flexibility using three different methods. First, 
it makes several existing instructions available on a larger number of units. 
Second, it adds cross-path access to the D-unit so that arithmetic and logical 
operations which use a cross-path may be scheduled there. Finally, it removes 
anumber of scheduling restrictions associated with 40-bit operations, allowing 
more flexible scheduling of high-precision code. 


6.1.2 Greater Memory Bandwidth 


The ’C64x provides double-word load and store instructions (.DDW and 
STDW) which can access 64 bits of data at a time. Up to two double-word load 
or store instructions can be issued every cycle. This provides a peak band- 
width of 128 bits per cycle to on-chip memory. 


6.1.3 Support for Packed Data Types 


6-2 


The ’C64x builds on the ’C62x’s existing support for packed data types by im- 
proving support for packed signed 16-bit data and adding new support for 
packed unsigned 8-bit data. Packed data types are supported using new pack/ 
unpack, logical, arithmetic and multiply instructions for manipulating packed 
data. 


Packed data types store multiple pieces of data within a single 32-bit register. 
Pack and unpack instructions provide a method for reordering this packed 
data, and for converting between packed formats. Shift and merge instructions 
(SHLMB and SHRMB) also provide a means for reordering packed 8-bit data. 


New arithmetic instructions include standard addition, subtraction, and com- 
parison, as well as advanced operations such as minimum, maximum, and av- 


Overview of ’C64x Architectural Enhancements 


erage. New packed multiply instructions provide support for both standard 
multiplies, as well as rounded multiplies and dot products. With packed data 
types, a single instruction can operate on two 16-bit quantities or four 8-bit 
quantities at once. 


6.1.4 Non-aligned Memory Accesses 


In order to capitalize on its memory and processing bandwidth, the 'C64x pro- 
vides support for non-aligned memory accesses. Non-aligned memory ac- 
cesses provide a method for accessing packed data types without the restric- 
tions imposed by 32-bit or 64-bit alignment boundaries. The 'C64x can access 
up to 64 bits per cycle at any byte boundary with non-aligned load and store 
instructions (LDNW, LDNDW, STNW, and STNDW). 


6.1.5 Additional Specialized Instructions 


The ’C64x also provides a number of new bit-manipulation and other special- 
ized instructions for improving performance on bit-oriented algorithms. These 
instructions are designed to improve performance on error correction, encryp- 
tion, and other bit-level algorithms. Instructions in this category include BITC4, 
BITR, ROTL, SHFL, and DEAL. See the TMS320C6000 CPU and Instruction 
Set User’s Guide for more details on these and related instructions. 
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6.2 Accessing Packed-Data Processing on the ’C64x 


6.2.1 


Introduction to Packed Data Processing Techniques 


Packed-data processing is a type of processing where a single instruction ap- 
plies the same operation to multiple independent pieces of data. For example, 
the ADD2 instruction performs two independent 16-bit additions between two 
pairs of 16-bit values. This produces a pair of 16-bit results. In this case, a 
single instruction, ADD2, is operating on multiple sets of data, the two indepen- 
dent pairs of addends. 


Packed-data processing is a powerful method for exploiting the inherent paral- 
lelism in signal processing and other calculation-intensive code, while retain- 
ing dense code. Many signal processing functions apply the same sets of op- 
erations to many elements of data. Generally, these operations are indepen- 
dent of each other. Packed-data processing allows the programmer to capital- 
ize on this by operating on multiple pieces of data with a single compact stream 
of instructions. This saves code size and dramatically boosts performance. 


The 'C64x provides a rich family of instructions which are designed to work 
with packed-data processing. At the core of this paradigm are packed data 
types, which are designed to store multiple data elements in a single machine 
register. Packed-data processing instructions are built around these data 
types to provide a flexible, powerful, programming environment. 


57>‘— TN 
Note: 


Although ’C6000 family supports both big-endian and little-endian operation, 
the examples and figures in this section will focus on little endian operation 
only. The packed-data processing extensions that the ’C64x provides will op- 
erate in either big- or little-endian mode, and will perform identically on val- 
ues stored in the register file. However, accesses to memory behave differ- 


ently in big-endian mode. 
a —N—..n———— 


6.2.2 Packed Data Types 


Packed data types are the cornerstone of ’C64x packed-data processing sup- 
port. Each packed data type packs multiple elements into a single 32-bit gener- 
al purpose register. Table 6—1 below lists the packed data types that the ’C64x 
supports. The greatest amount of support is provided for unsigned 8-bit and 
signed 16-bit values. 


Table 6-1. Packed data types 


Element Size Signed/Unsigned 
8 bits unsigned 

16 bits signed 

8 bits signed 

16 bits unsigned 
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Elements in 32-bit Element type 


word 

4 unsigned char 
2 short 

4 char 

2 unsigned short 


6.2.3 Storing Multiple Elements in a Single Register 


Level of support 


high 
high 
limited 


limited 


Packed data types can be visualized as 8-bit or 16-bit partitions inside the larg- 
er 32-bit register. These partitions are merely logical partitions. As with all 
’C64x instructions, instructions which operate on packed data types operate 
directly on the 64 general purpose registers in the register file. There are no 
special packed data registers. How data in a register is interpreted is deter- 
mined entirely by the instruction that is using the data. Figure 6-1 and 
Figure 6—2 illustrate how four bytes and two half-words are packed into a 


single word. 


Figure 6—1. Four Bytes Packed Into a Single General Purpose Register. 


8 bits 8 bits 


Byte 3 Byte 2 Byte 1 Byte 0 


~“ \ 
~ \\ 
os \ 
Ss 


\ 


Byte 3 Byte 2 Byte 1 Byte 0 


8 bits 


\ i JF 
\ oe 


32 bits ‘ 


8 bits 


a 
x 
= 
ae 


a 


General purpose 
register 
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Figure 6-2. Two Half—Words Packed Into a Single General Purpose Register. 


16 bits 16 bits 


Halfword 1 Halfword 0 
\ \ / / 
\ \ / 


\ 


/ 


Halfword 1 Halfword 0 General purpose 
register 
_ 32 bits i, 


Notice that there is no distinction between signed or unsigned data made in 
Figure 6-1 and Figure 6-2. This is due to the fact that signed and unsigned 
data are packed identically within the register. This allows the instructions 
which are not sensitive to sign bits (such as adds and subtracts) to operate on 
signed and unsigned data equally. The distinction between signed and un- 
signed comes into play primarily when performing multiplication, comparing 
elements, and unpacking data (since either sign or zero extension must be 
performed). 


Table 6—2 provides a summary of the operations that the ’C64x provides on 
packed data types, and whether signed or unsigned types are supported. In- 
structions which were not specifically designed for packed data types can also 
be used to move, store, unpack, or repack data. 
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Table 6-2. Supported Operations on Packed Data Types 


Operation Support for 8-bit Support for 16-bit Notes 
Signed Unsigned Signed Unsigned 

ADD/SUB Yes Yes Yes Yes 

Saturated ADD Yes Yes . 

Booleans Yes Yes Yes Yes Uses generic 
boolean instruc- 
tions 

Shifts Yes Yes Right-shift only 

Multiply * Yes Yes * 

Dot Product * Yes Yes * 

Max/Min/ Yes Yes CMPEQ works 

Compare with signed or 
unsigned 

Pack Yes Yes Yes Yes 

Unpack Yes Yes Yes See Table 6-4 
for 16-bit un- 
packs 


* = Only ‘signed-by-unsigned’ support in these categories. 


6.2.4 Packing and Unpacking Data 


The ’C64x provides a family of packing and unpacking instructions which are 
used for converting between various packed and non-packed data types, as 
well as for manipulating elements within a packed type. Table 6—4 lists the 
available packing instructions and uses. 
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Table 6-3. Instructions for Manipulating Packed Data Types 


Mnemonic Intrinsic 
PACK2 _pack2 
PACKH2 _packh2 
PACKHL2 _packhl2 
PACKLH2 _packlh2 
SPACK2 _spack2 
SHR (n/a) 
SHRU (n/a) 

EXT _ext 
EXTU _extu 
PACKH4 _packh4 
PACKL4 _packl4 
UNPKHU4 _unpkhu4 
UNPKLU4 _unpklu4 
SPACKU4 _spacku4 
SHLMB _shlmb 
SHRMB _shrmb 
SWAP4 _swap4 
ROTL _rotl 


Typical Uses With Packed Data 


Packing 16-bit portions of 32-bit quantities. 
Rearranging packed 16-bit quantities. 


Rearranging pairs of 16-bit quantities. 


Saturating 32-bit quantities down to signed 16-bit values, packing 
them together. 


Unpacking 16-bit values into 32-bit values 


Unpacking 16-bit intermediate results into 8-bit final results. 


De-interleaving packed 8-bit values. 


Unpacking unsigned 8-bit data into 16-bits. 


Preparing 8-bit data to be interleaved. 


Saturating 16-bit quantities down to unsigned 8-bit values, packing 
them together. 


Rearranging packed 8-bit quantities 


The _packXX2 group of intrinsics work by extracting selected half-words from 
two 32-bit registers, returning the results packed into a 32-bit word. This is pri- 
marily useful for manipulating packed 16-bit data, although they may be used 
for manipulating pairs of packed 8-bit quantites. Figure 6—3 illustrates the four 
_packXXa2() intrinsics, pack2(), packlh2(), packhl2(), and _packh2(). (The 
land the hin the name refers to which half of each 32-bit input is being copied 
to the output, similar to how the _mpyXX() intrisics are named.) 
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Figure 6—3. Graphical Representation of _packXxX2 Intrinsics 
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The saturating pack intrinsic, _spack2, is closely related to the _pack2 intrin- 
sic. The main difference is that the saturating pack first saturates the signed 
32-bit source values to signed 16-bit quantities, and then packs these results 
into a single 32-bit word. This makes it easier to support applications which 
generate some intermediate 32-bit results, but need a signed 16-bit result at 
the end. Figure 6-4 shows _spack2’s operation graphically. 
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Figure 6-4. Graphical Representation of _spack2 
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Notice that there are no special unpack operations for 16-bit data. Instead, the 
normal 32-bit right-shifts and extract operations can be used to unpack 16-bit 
elements into 32-bit quantities. Table 6—4 describes how to unpack signed and 
unsigned 16-bit quantities. 


Table 6-4. Unpacking Packed 16-bit Quantities to 32-bit Values 


Type Position 

Signed 16-bit Upper half 
Lower half 

Unsigned 16-bit Upper half 


Lower half 


C code 


dst 


dst 


dst 


dst 


= ((signed) src) 


= _ext (src, 


16, 


>> 16; 


16); 


= ((unsigned) src) >>16; 


= _ext 


(src, 


16; 


16); 


SHR src, 


EXT src, 


SHRU src, 


EXTU src, 


Assembly code 


16, dst 
16,16, dst 
16, dst 


16,16, dst 


For 8-bit data, the ’C64x provides the __packX4,_spacku4, and_unpkX¢4 intrin- 
sics for packing and unpacking data. The operation of these intrinsics is illus- 
trated in Figure 6—5 and Figure 6-6. These intrinsics are geared around con- 
verting between 8-bit and 16-bit packed data types. To convert between 32-bit 
and 8-bit values, an intermediate step at 16-bit precision is required. 
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Figure 6—5. Graphical Representation of 8—bit Packs (_packX4 and _spacku4) 


b a 
pes [pz |e [oo } | as] a2 | ar] aol 
N N , / i a 
ee Ay ~ % y 2 oe wee 
SS XN 4 Te 2 


Cc 


b a 
pos [oe] or] oo} | as | az | ai | aol 
=o Ss \ 4 4 
oe ey “ N i i Pa Pa 


a 
N 


= 


SL SN \ 7 Jv 4 


Cc 


signed 16-bit signed 16-bit signed 16-bit signed 16-bit 


Saturation ~~ ~ | ~ | ae 
~ 


Unsigned 8-bit }sat(b_hi) sat(b_lo) 


N Be 


N < \ \ / J x a 
~N 


Ta ™~ 


sat(a_lo) 


/ foe 


~~ \ 4 
sat(b_hi)|sat(b_lo) 


Cc 


c =_spacku4(b, a) 


‘C64x Programming Considerations 6-11 


Accessing Packed-Data Processing on the ’C64x 


Figure 6-6. Graphical Representation of 8—bit Unpacks (_unpkXu4) 


b = unpkhu4(a); 


b = unpklu4(a); 
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The ’C64x also provides a handful of additional byte-manipulating operations 
that have proven useful in various algorithms. These operations are neither 
packs nor unpacks, but rather shuffle bytes within a word. Uses include con- 
volution-oriented algorithms, complex number arithmetic, and so on. Opera- 
tions in this category include the intrinsics_shlmb, shrmb, swap4, and_rotl. 
The first three in this list are illustrated in Figure 6—7. 
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Figure 6—7. Graphical Representation of (_shlmb, _shrmb, and __swap4) 


6.2.5 Optimizing for Packed Data Processing 


The ’C64x supports two basic forms of packed-data optimization, namely vec- 
torization and macro operations. 


Vectorization works by applying the exact same simple operations to several 
elements of data simultaneously. Kernels such as vector sum and vector multi- 
ply, shown in Example 6-1 and Example 6-2, exemplify this sort of computa- 
tion. 
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Example 6-1. Vector Sum 


void vec_sum(const short *restrict a, const short *restrict b, 
short *restrict c, int len) 


{ 


cio camel, 


for 


i < len; itt) 
+ ali]; 


Example 6-2. Vector Multiply 


void vec_mpy(const short *restrict a, const short *restrict b, 
short *restrict c, int len, int shift) 


{ 


Inte. 13 


for 


i < len; i++) 
* a[i]) >> shift; 


This type of code is referred to as vector code because each of the input arrays 
is a vector of elements, and the same operation is being applied to each ele- 
ment. Pure vector code has no computation between adjacent elements when 
calculating results. Also, input and output arrays tend to have the same num- 
ber of elements. Figure 6-8 illustrates the general form of a simple vector op- 
eration that operates on inputs from arrays A and B, producing an output, C 
(such as our Vector Sum and Vector Multiply kernels above perform). 


Figure 6-8. Graphical Representation of a Simple Vector Operation 
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Although pure vector algorithms exist, most applications do not consist purely 
of vector operations as simple as the one shown above. More commonly, an 
algorithm will have portions, which behave as a vector algorithm, and portions 
which do not. These portions of the code are addressed by other packed-data 
processing techniques. 


The second form of packed data optimization involves combining multiple op- 
erations on packed data into a single, larger operation referred to here as a 
macro operation. This can be very similar to vectorization, but the key differ- 
ence is that there is significant mathematical interaction between adjacent ele- 
ments. Simple examples include dot product operations and complex multi- 
plies, as shown in Example 6-3 and Example 6-4. 


Example 6-3. Dot Product 


int dot_prod(const short *restrict a, const short *restrict b, int len) 


{ 


int i; 
int sum 


for (i ; i < len; itt) 
sum += 1 * alli; 


return 


Example 6—4. Vector Complex Multiply 


void vec_cx_mpy (const short *restrict a, const short *restrict b, 
short *restrict c) 
{ 


int J; 


for (i = j = 0; i < len; i++, j += 2) 


{ 


/* Real components are at even offsets, and imaginary components 
are at odd offsets within each array. */ 

jt+0] (aljt b[j+0] — afjtl] * bl[jt1]) >> 16; 

jt+1] (aljt b[jtl] + afjtl] * b[jt+0]) >> 16; 


The data flow for the dot product is shown in Figure 6—9. Notice how this is sim- 
ilar to the vector sum in how the array elements are brought in, but different 
in how the final result is tabulated. 
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Figure 6-9. Graphical Representation of Dot Product 


Input A 


Input B 


As you can see, this does not fit the pure vector model presented in 
Example 6-3. The Vector Complex Multiply also does not fit the pure vector 
model, but for different reasons. 


Mathematically, the vector complex multiply is a pure vector operation per- 
formed on vectors of complex numbers, as its name implies. However, it is not, 
in implementation, because neither the language type nor the hardware itself 
directly supports a complex multiply as a single operation. 


The complex multiply is built up from a number of real multiplies, with the com- 
plex numbers stored in arrays of interleaved real and imaginary values. As a 
result, the code requires a mix of vector and non—vector approaches to be opti- 
mized. Figure 6-10 illustrates the operations that are performed on a single 
iteration of the loop. As you can see, there is a lot going on in this loop. 
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Figure 6—10. Graphical Representation of a Single Iteration of Vector Complex Multiply. 


rere Array element 2n+1 Array element 2n 
P (real component) (imaginary component) 
Array element 2n+1 Array element 2n 
Input B (real component) (imaginary component) 


Output ¢ Array element 2n+1 Array element 2n 
(real component) (imaginary component) 


The following sections revisit these basic kernels and illustrate how single in- 
struction multiple data optimizations apply to each of these. 


6.2.6 Vectorizing With Packed Data Processing 


The most basic packed data optimization is to use wide memory accesses, in 
other words, word and double-word loads and stores, to access narrow data 
such as byte or half-word data. This is a simple form of vectorization, as de- 
scribed above, applied only to the array accesses. 


Widening memory accesses generally serves as a starting point for other vec- 
tor and packed data operations. This is due to the fact that the wide memory 
accesses tend to impose a packed data flow on the rest of the code around 
them. This type of optimization is said to work from the outside in, as loads and 
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stores typically occur near the very beginning and end of the loop body. The 
following examples use this outside-in approach to perform packed data opti- 
mization techniques on the example kernels. 


Fs ———__co“a_ooaQaaQaaoaooaoon_Q 0 —_——._'—= ——— 
Note: 


The following examples assume that the compiler has not performed any 
packed data optimizations. The most recent release of the ‘C6000 Code 
Generation Tools will apply many of the optimizations described in this chap- 


ter automatically when presented with sufficient information about the code. 
ss | 


6.2.6.1 Vectorizing the Vector Sum 


Consider the vector sum kernel presented in Example 6-1. In its default form, 
it reads one half—word from the al ] array, one half-word from the b[ ] array, 
adds them, and writes a single half—-word result to the c[ ] array. This results 
in a 2-cycle loop that moves 48 bits per iteration. When you consider that the 
’C64x can read or write 128 bits every cycle, it becomes clear that this is very 
inefficient. 


One simple optimization is to replace the half-word accesses with double-word 
accesses to read and write array elements in groups of four. When doing this, 
array elements are read into the register file with four elements packed into a 
register pair. The array elements are packed with, two elements in each regis- 
ter, across two registers. Each register contains data in the packed 16-bit data 
type illustrated in Figure 6-2. 


For the moment, assume that the arrays are double-word aligned, as shown 
in Example 6-5. For more information about methods for working with arrays 
that are not specially aligned, see section 6.2.8. The first thing to note is that 
the ‘C6000 Code Generation Tools lack a 64-bit integral type. This is not a 
problem, however, as the tools allow the use of double, and the intrinsics _lo(), 
_hi(), _itod() to access integer data with double-word loads. To account for the 
fact that the loop is operating on multiple elements at a time, the loop counter 
must be modified to count by fours instead of by single elements. 


The _amemd8 and _amemd8_const intrinsics tell the compiler to read the 
array of shorts with double—word accesses. This causes LDDW and STDW in- 
structions to be issued for the array accesses. The _lo() and _hi() intrinsics 
break apart a 64-bit double into its lower and upper 32-bit halves. Each of these 
halves contain two 16-bit values packed in a 32-bit word. To store the results, 
the _itod() intrinsics assemble 32-bit words back into 64-bit doubles to be 
stored. Figure 6-11 and Figure 6-12 show these processes graphically. 


The adds themselves have not been addressed, so for now, the add is re- 
placed with a comment. 
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Example 6—5. Vectorization: Using LDDW and STDW in Vector Sum 


Void vec_sum(const short *restrict a, const short *restrict b, 
short *restrict c, int len) 
{ 
int i? 
unsigned = al_a0o; 
unsigned = b1_b0; 
unsigned oullmot 0) 


for (i = 0; i len; i += 4) 

{ 
a3 i (_amemd8_const (&a 
al (_amemd8_const (&a 


b3 = i (_amemd8_const (& 
bl = (_amemd8_const (& 


, 
, 


/*  ...somehow, the ADD occurs here, 
with results in c3_c2, cl_cO... */ 


_amemd8 (&c [i _itod (¢c3_c2, cl_c0); 


Figure 6—11.Array Access in Vector Sum by LDDW 


ZF] 16 bits ia 
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Figure 6-12. Array Access in Vector Sum by STDW 
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This code now efficiently reads and writes large amounts of data. The next step 
is to find a method to quickly add them. The _add2() intrinsic provides just that: 
It adds corresponding packed elements in two different words, producing two 
packed sums. It provides exactly what is needed, a vector addition. 
Figure 6—13 illustrates. 


Figure 6-13. Vector Addition 
c_lo =_add2(b_lo, a_lo); 


a_lo a[1] a[0] 


+ + 


b lo b[1] b[0] 


c_lo c[1] = b[1] + a[1] c[0] = b[0] + a[0] 


So, putting in _add2() to perform the additions provides the complete code 
shown in Example 6-6. 
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Example 6-6. Vector Addition (Complete) 


void vec_sum(const short *restrict a, const short *restrict b, 
short *restrict c, int len) 


{ 


int i; 


unsigned a3_a2, al_a0; 
unsigned b3_b2, b1_b0; 
unsigned c3_c2, cl_c0; 


for (i = 0; i < len; i 


{ 
a3_a2 


amemd8 


al_a0 


amemd8 


b3_b2 


amemd8_const (éb[i])); 


b1_b0 


amemd8_const (&b[i])); 


C322 
el co 


_add2 (b3_b2, a3_a2); 
—add2(b1_b0, al_a0); 


_—amemd8 (&c[i]) = _itod(c3_c2, cl_c0); 


At this point, the vector sum is fully vectorized, and can be optimized further 
using other traditional techniques such as loop unrolling and software pipelin- 
ing. These and other optimizations are described in detail throughout Chapter 
6. 


6.2.6.2 Vectorizing the Vector Multiply 


The vector multiply shown in Figure 6-8 is similar to the vector sum, in that the 
algorithm is a pure vector algorithm. One major difference, is the fact that the 
intermediate values change precision. In the context of vectorization, this 
changes the format the data is stored in, but it does not inhibit the ability to vec- 
torize the code. 


The basic operation of vector multiply is to take two 16-bit elements, multiply 
them together to produce a 32-bit product, right-shift the 32-bit product to pro- 
duce a 16-bit result, and then to store this result. The entire process for a single 
iteration is shown graphically in Figure 6-14. 
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Figure 6—14. Graphical Representation of a Single Iteration of Vector Multiply. 
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Notice that the values are still loaded and stored as 16-bit quantities. There- 
fore, you should use the same basic flow as the vector sum. Example 6—7 
shows this starting point. Figure 6-11 and Figure 6—12 also apply to this exam- 
ple to illustrate how data is being accessed. 
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Example 6—7. Using LDDW and STDW in Vector Multiply 


void vec_mpy(const short *restrict a, const short *restrict b, 
short *restrict c, int len, int shift) 
{ 
int i; 
unsigned a3_a2, al_a0; 
unsigned b3_b2, b1_b0; 
unsigned c3_c2, cl_c0; 


for (i = 0; i < len; i += 4) 

{ 
a3 hi (_amemd8_const (&a[i])); 
al lo(_amemd8_const (&a[i])); 


b3 hi (_amemd8 
bl lo (_amemd8 


/* ...somehow, the Multiply and Shift occur here, 
with results in c3_c2, cl_cO... */ 


—amemd8 (&éc[i]) = _itod(c3_c2, cl_c0); 


The next step is to perform the multiplication. The ’C64x intrinsic, _mpy2(), 
performs two 16 x 16 multiplies, providing two 32-bit results packed in a 64-bit 
double. This provides the multiplication. The _lo() and _hi() intrinsics allow 
separation of the two separate 32-bit products. Figure 6-15 illustrates how 
_mpy2() works. 


Figure 6—15. Packed 16x16 Multiplies Using _mpy2 
— 16 bits —?%<— 16 bits 


32-bit 
a_lo alt] a0] register 
32-bit 
b_lo bl] bl] register 
64-bit 


c_lo_ dbl a[1] * b[1] a[0] * b[0] register 
pair 
— 32 bits rte 32 bits —— 


Once the 32-bit products are obtained, use standard 32-bit shifts to shift these 
to their final precision. However, this will leave the results in two separate 32-bit 
registers. 
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The ’C64x provides the _pack family intrinsics to convert the 32-bit results into 
16-bit results. The _packXX2() intrinsics, described in section 6.2.4, extract 
two 16-bit values from two 32-bit registers, returning the results in a single 
32-bit register. This allows efficient conversion of the 32-bit intermediate re- 
sults to a packed 16-bit format. 


In this case, after the right-shift, the affected bits will be in the lower half of the 
32-bit registers. Use the _pack2() intrinsic to convert the 32-bit intermediate 
values back to packed 16-bit results so they can be stored. The resulting C 
code is shown in Example 6-8. 


Example 6-8. Using __mpy2() and __pack2() to Perform the Vector Multiply 


void vec_mpyl(const short *restrict a, const short *restrict b, 
short *restrict c, int len, int shift) 
{ 
int i; 
unsigned a3_a2, al_a0; /* Packed 16-bit values 
unsigned b3_b2, bl1_b0; /* Packed 16-bit values 
double c3_c2_dbl, cl_c0O_dbl; /* 32-bit prod in 64-bit pairs */ 
int 63; €2,. c1,;. C0; /* Separate 32-bit products */ 
unsigned c3_c2, cl_c0; /* Packed 16-bit values # 


for 1 = 4 len; i += 4) 
(_amemd8_const (éa[i])); 


(_amemd8_const (&a[i])); 


b3_b2 = i (_amemd8_const (&b[i])); 
b1l_bO = _lo(_amemd8_const (&b[i])); 


/* Multiply elements together, producing four products */ 
c3_c2_dbl = _mpy2(a3_a2, b3_b2); 
cl_cO_dbl = _mpy2(al_a0, bl1_b0); 


/* Shift each of the four products right by our shift amount */ 
c3 = _hi(c3_c2_dbl) >> shift; 
c2 = _lo(c3_c2_dbl) >> shift; 
cl = _hi(cl_cO_dbl) >> shift; 
cO = _lo(cl_cO_dbl) >> shift; 


/* Pack the results back together into packed 16-bit format */ 
c3_c2 = _pack2(c3, c2); 
cl_cO = _pack2(cl, c0O); 


/* Store the results. */ 
_—amemd8 (&c[i]) = _itod(c3_c2, cl_c0); 
} 
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This code works, but it is heavily bottlenecked on shifts. One way to eliminate 
this bottleneck is to use the packed 16-bit shift intrinsic, _shr2(). This can be 
done without losing precision, under the following conditions: 


(J If the shift amount is known to be greater than or equal to 16, use 
_packh2() instead of _pack2() before the shift. If the shift amountis exactly 
16, eliminate the shift. The _packh2 effectively performs part of the shift, 
shifting right by 16, so that the job can be finished with a _shr2() intrinsic. 
Figure 6—16 illustrates how this works. 


(J Ifthe shift amount is less than 16, only use the _shr2() intrinsic if the 32-bit 
products can be safely truncated to 16 bits first without losing significant 
digits. In this case, use the _pack2() intrinsic, but the bits above bit 15 are 
lost in the product. This is safe only if those bits are redundant (sign bits). 
Figure 6—17 illustrates this case. 
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Figure 6—16. Fine Tuning Vector Multiply (shift > 16) 
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Figure 6—17. Fine Tuning Vector Multiply (shift < 16) 
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Whether or not the 16-bit shift version is used, consider the vector multiply to 
be fully optimized from a packed data processing standpoint. It can be further 
optimized using the more general techniques such as loop-unrolling and soft- 
ware pipelining that are discussed in Chapter 6. 
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6.2.7. Combining Multiple Operations in a Single Instruction 


The Dot Product and Vector Complex Multiply examples that were presented 
earlier were both examples of kernels that benefit from macro operations, that 
is, instructions which perform more than a simple operation. 


The ’C64x provides a number of instructions which combine common opera- 
tions together. These instructions reduce the overall instruction count in the 
code, thereby reducing codesize and increasing code density. They also tend 
to simplify programming. Some of the more commonly used macro operations 


are listed in Table 6—5. 


Table 6—5. Intrinsics Which Combine Multiple Operations in one Instruction 


Intrinsic 


_dotp2 


Instruction 


DOTP2 


Operations combined 


Performs two 16x16 multiplies and adds the products 


_dotpn2 


_dotprsu2 


_dotpnrsu2 


_dotpu4 
_dotpsu4 


_max2 


_min2 


_maxu4 


_minu4 


_avg2 


_avgu4 


_subabs4 


6-28 


DOTPN2 


DOTPRSU2 


DOTPNRSU2 


DOTPU4 
DOTPSU4 


MAX2 
MIN2 


MAXU4 
MINU4 


AVG2 


AVGU4 


SUBABS4 


together. 


Performs two 16x16 multiplies and subtracts the sec- 
ond product from the first. 


Performs two 16x16 multiplies, adds products togeth- 
er, and shifts/rounds the sum. 


Performs two 16x16 multiplies, subtracts the 2nd 
product from the 1st, and shifts/rounds the difference. 


Performs four 8x8 multiplies and adds products to- 
gether. 


Compares two pairs of numbers, and selects the 
larger/smaller in each pair. 


Compares four pairs of numbers, and selects the 
larger/smaller in each pair. 


Performs two 16-bit additions, followed by a right shift 
by 1 with round. 


Performs four 8-bit additions, followed a right shift by 
1 with round. 


Finds the absolute value of the between four pairs of 
8-bit numbers. 


As you can see, these macro operations can replace a number of separate in- 
structions rather easily. Forinstance, each_dotp2 eliminates an add, and each 
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_dotpu4 eliminates three adds. The following sections describe how to write 
the Dot Product and Vector Complex Multiply examples to take advantage of 
these. 


6.2.7.1. Combining Operations in the Dot Product Kernel 


The Dot Product kernel, presented in Example 6-3, is one which benefits both 
from vectorization as well as macro operations. First, apply the vectorization 
optimization as presented earlier, and then look at combining operations to fur- 
ther improve the code. 


Vectorization can be performed on the array reads and multiplies that are this 
kernel, as described in section 6.2.3. The result of those steps is the intermedi- 
ate code shown in Example 6-9. 


Example 6-9. Vectorized Form of the Dot Product Kernel 


int dot_prod(const short *restrict a, const short *restrict b, 
short *restrict c, int len) 
{ 
int i; 
unsigned a3_a2, al_a0; /* Packed 16-bit values 
unsigned b3_b2, b1_b0; /* Packed 16-bit values 
double c3_c2_ dbl, cl_cO_dbl; /* 32-bit prod in 64-bit pairs */ 
i /* Sum to return from dot_prod */ 


len; i += 4) 


(_amemd8_const (éa[i])); 
(_amemd8_const (&a[i])); 


(_amemd8_const (&b[i]) ) 
lo (_amemd8_const (&b[i]) ) 


, 
, 


Multiply elements together, producing four products */ 
_c2_dbl = _mpy2(a3_a2, b3_b2); 
cl_cO_dbl = _mpy2(al_a0, b1_b0); 


/* Add the four products to our running sum. */ 
sum += _hi(c3_c2_dbl); 

sum += _lo(c3_c2_dbl); 

sum += _hi(cl_cO_dbl); 

sum += _lo(cl_c0O_dbl1) 


Y 
, 


return sum; 
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While this code is fully vectorized, it still can be improved. The kernel itself is 
performing two LDDWs, two MPY2, four ADDs, and one Branch. Because of 
the large number of ADDs, the loop cannot fit in a single cycle, and so the ’C64x 
datapath is not used efficiently. 


The way to improve this is to combine some of the multiplies with some of the 
adds. The ’C64x family of _dotp intrinsics provides the answer here. 
Figure 6—18 illustrates how the _dotp2 intrinsic operates. Other _dotp intrin- 
sics operate similarly. 


Figure 6—18. Graphical Representation of the _dotp2 Intrinsic c = _dotp2(b, a) 
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This operation exactly maps to the operation the dot product kernel performs. 
The modified version of the kernel absorbs two of the four ADDs into _dotp in- 
trinsics. The result is shown as Example 6-11. Notice that the variable c has 
been eliminated by summing the results of the _dotp intrinsic directly. 
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Example 6—10. Vectorized Form of the Dot Product Kernel 


int dot_prod(const short *restrict a, const short *restrict b, 
short *restrict c, int len) 
{ 
int i; 
unsigned a3_a2, al_a0; /* Packed 16-bit values 
unsigned b3_b2, b1_b0; /* Packed 16-bit values 
int sum = 0; /* Sum to return from dot_prod 


i < len; i += 4) 


(_amemd8_const (&a[i])); 


(_amemd8_const (&éa[i]) ) 


(_amemd8_const (&éb[i])) 
lo (_amemd8_const (&b[i]) ) 


’ 
, 


Perform dot-products on pairs of elements, totalling the 
results in the accumulator. */ 
sum += _dotp2(a3_a2, b3_b2); 
sum += _dotp2(al_a0, bl_b0); 


return sum; 


At this point, the code takes full advantage of the new features that the ’C64x 
provides. In the particular case of this kernel, no further optimization should 
be necessary. The tools produce an optimal single cycle loop, using the com- 
piler version that was available at the time this book was written. 


Example 6-11. Final Assembly Code for Dot—Product Kernel’s Inner Loop 


BO,1,B0 : 

B8,B7,B7 ; [10] 

A7,A6,A6 ; |10| 

B5,A5,B8 ; @@@@|10) 
B4,A4,A7 ; @@@@|10) 
L2,A0 ; @@eee 
*A3++,A5:A4 ; @@@@eA@EeG|10| 
*B6++,B5:B4 ; @@@@eAe@EE|10| 
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6.2.7.2. Combining Operations in the Vector Complex Multiply Kernel 


The Vector Complex Multiply kernel that was originally shown in Example 6—4 
can be optimized with a technique similar to the one that used with the Dot 
Product kernel in Section 8.2.4.1. First, the loads and stores are vectorized in 
order to bring data in more efficiently. Next, operations are combined together 
into intrinsics to make full use of the machine. 


Example 6—12 illustrates the vectorization step. For details, consult the earlier 
examples, such as the Vector Sum. The complex multiplication step itself has 
not yet been optimized at all. 
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Example 6—12. Vectorized form of the Vector Complex Multiply Kernel 


void vec_cx_mpy (const short *restrict a, const short *restrict b, 
short *restrict c, int len, int shift) 


bat: 1 

unsigned a3_a2, al_a0; /* Packed 16-bit values 
unsigned b3_b2, b1_b0; /* Packed 16-bit values 
short a3, a2, al, a0; /* Separate 16-bit elements 
short b3, b2, bl, b0; /* Separate 16-bit elements 
short cs, ©2;, cl, -<c0; /* Separate 16-bit results 
unsigned c3_c2, cl_c0; /* Packed 16-bit values 


for (i = 0; i < len; i += 4) 

{ 
/* Load two complex numbers from the a[] array. 
/* The complex values loaded are represented as '’a3 + a2 * 
/* and ’al + aO * 3’. That is, the real components are a3 
/* and al, and the imaginary components are a2 and a0. 
a3 = _hi(_amemd8_const (&a[i])); 
al = _lo(_amemd8_const (&a[i]))j; 


[* two complex numbers from the b[] array. 
b3 = _hi(_amemd8_const (&b[i])); 
bl lo(_amemd8_const (&b[i])); 


/* Separate the 16-bit coefficients so that the complex 


/* multiply may be performed. This portion needs further 
/* optimization. 

a3 = ((signed) a3_a2) >> 16; 

a2 _ext (a3_a2, 16, 16); 

al ((signed) al_a0) >> 16; 

a0 _ext(al_aO, 16, 16); 


b3 = ((signed) b3_b2) >> 16; 
b2 _ext (b3_b2, 16, 16); 
bl = ((signed) bl1_b0) >> 16; 
bo ext (b1_b0, 16, 16); 


/* Perform the complex multiplies using 16x16 multiplies. 
c3 = (b3 * a2 + b2 * a3) >> 16; 
c2 = (b3 * a3 - b2 * a2) >> 16; 


cl = (bl * a0 + bO 16; 
cO = (bl * al - bO 16; 


/* Pack the 16-bit results into 32-bit words. 
c3_c2 = _pack2(c3, c2); 
cl_cO = _pack2(cl, c0); 


/* Store the results. */ 
—amemd8 (&c[i]) = _itod(c3_c2, cl_c0); 
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Example 6-12 still performs the complex multiply as a series of discrete steps 
once the individual elements are loaded. The next optimization step is to com- 
bine some of the multiplies and adds/subtracts into_dotp and_dotpn intrinsics 
in a similar manner to the Dot Product example presented earlier. 


The real component of each result is calculated by taking the difference be- 
tween the product of the real components of both input and the imaginary com- 
ponents of both inputs. Because the real and imaginary components for each 
input array are laid out the same, the _dotpn intrinsic can be used to calculate 
the real component of the output. Figure 6—19 shows how this flow would work. 


Figure 6—19. The _dotpn2 Intrinsic Performing Real Portion of Complex Multiply. 
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The calculation for the result’s imaginary component provides a different prob- 
lem. As with the real component, the result is calculated from two products that 
are added together. A problem arises, though, because it is necessary to multi- 
ply the real component of one input with the imaginary component of the other 
input, and vice versa. None of the ’C64x intrinsics provide that operation direct- 
ly given the way the data is currently packed. 
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The solution is to reorder the halfwords from one of the inputs, so that the imag- 
inary componentis in the upper halfword and the real component is in the lower 
halfword. This is accomplished by using the _packlh2 intrinsic to reorder the 
halves of the word. Once the half—words are reordered on one of the inputs, 
the _dotp intrinsic provides the appropriate combination of multiplies with an 
add to provide the imaginary component of the output. 


Figure 6-20. _packlh2 and __dotp2 Working Together. 


a’ =_packlh2(a, a); 


[reat [ sae | 


[near | Peat] 


* * 


[Feat rasnany_ 
a_imaginary * b_real a_real * b_imaginary 


c =_dotp2 (b, _packl2(a, a)) 


. 32 bit ‘. 


Once both the real and imaginary components of the result are calculated, it 
is necessary to convert the 32-bit results to 16-bit results and store them. In 
the original code, the 32-bit results were shifted right by 16 to convert them to 
16-bit results. These results were then packed together with _pack2 for stor- 
ing. Our final optimization replaces this shift and pack with a single _packh2. 
Example 6-13 shows the result of these optimizations. 
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Example 6—13. Vectorized form of the Vector Complex Multiply 


void vec_cx_mpy(const short *restrict a, const short *restrict b, 
short *restrict c, int len, int shift) 

{ 
dnt: diy 
unsigned a3_a2, al_a0; /* Packed 16-bit values 
unsigned b3_b2, b1_b0; /* Packed 16-bit values 
int e3;¢2, cl,c0; /* Separate 32-bit results 
unsigned c3_c2, cl_c0; /* Packed 16-bit values 


for (i = 0; i < len; i += 4) 
{ 
Load two complex numbers from the a[] array. 
The complex values loaded are represented as ’a3 + a2 * jj’ 
and ’al + aOQ * j’. That is, the real components are a3 
and al, and the imaginary components are a2 and a0. 
a2 = _hi(_amemd8_const (&a[i])); 
ad lo (_amemd8_const (éa[i])); 


Load two complex numbers from the b[] array. 
b2 = _hi(_amemd8_const (&b[i])); 
bO = _lo(_amemd8_const (&b[i])); 


Perform the complex multiplies using _dotp2/_dotpn2. 
= _dotpn2(b3_b2, a3_a2); /* Real 
= _dotp2 (b3_b2, _packlh2(a3_a2, a3_a2)); /* Imaginary 


_dotpn2(b1_b0, al_a0); /* Real 
_dotp2 (b1_b0, _packlh2(al_a0, al_a0O)); /* Imaginary 


Pack the 16-bit results from the upper halves of the 
32-bit results into 32-bit words. 
_—c2 = _packh2(c3, c2); 
cl_cO = _packh2(cl, c0O); 


/* Store the results. */ 
_—amemd8 (&c[i]) = _itod(c3_c2, cl_c0); 


As with the earlier examples, this kernel now takes full advantage of the 
packed data processing features that the ’C64x provides. More general opti- 
mizations can be performed as described in Chapter 6 to further optimize this 
code. 
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6.2.8 Non-Aligned Memory Accesses 


In addition to traditional aligned memory access methods, the ’C64x also pro- 
vides intrinsics for non-aligned memory accesses. Aligned memory accesses 
are restricted to an alignment boundary that is determined by the amount of 
data being accessed. For instance, a 64-bit load must read the data from a 
location at a 64-bit boundary. Non-aligned access intrinsics relax this restric- 
tion, and can access data at any byte boundary. 


There are a number of tradeoffs between aligned and non-aligned access 
methods. Table 6-6 lists the differences between both methods. 


Table 6-6. Comparison Between Aligned and Non-Aligned Memory Accesses 


Aligned Non-Aligned 

Data must be aligned on a boundary Data may be aligned on any byte 
equal to its width. boundary. 

Can read or write bytes, half-words, Can only read or write words and 
words, and double-words. double-words. 


Up to two accesses may be issued per Only one non-aligned access may be 
cycle, for a peak bandwidth of 128 bits/ issued per cycle, for a peak bandwidth 


cycle. of 64 bits/cycle. 

Bank conflicts may occur. No bank conflict possible, because no 
other memory access may occur in par- 
allel. 


Because the ’C64x can only issue one non-aligned memory access per cycle, 
programs should focus on using aligned memory accesses whenever pos- 
sible. However, certain classes of algorithms are difficult or impossible to fit 
into this mold when applying packed-data optimizations. For example, con- 
volution-style algorithms such as filters fall in this category, particularly when 
the outer loop cannot be unrolled to process multiple outputs at one time. 
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6.2.8.1 Using Non-Aligned Memory Access Intrinsics 


Non-aligned memory accesses are generated using the _memXxX() and 
_memXX_consi() intrinsics. These intrinsics generate a non-aligned refer- 
ence which may be read or written to, much like an array reference. 
Example 6—14 below illustrates reading and writing via these intrinsics. 


Example 6—14. Non-aligned Memory Access With __mem4 and_memd8 


char a[1000]; /* Sample array */ 
double d; 
const short cs [1000]; 


/* Store two bytes at a[69] and a[70] */ 
_mem2 (&a[69]) = 0x1234; 


/* Store four bytes at a[9] through a[12] */ 


_mem4 (&a[9]) = 0x12345678; 


Load eight bytes from a[115] through a[122] 
_memd8 (&a[115]); 


Load four shorts from cs[42] through cs[45] 
_memd8_const (écs[42]); 


It is easy to modify code to use non-aligned accesses. Example 6-15 below 
shows the Vector Sum from Example 6-6 rewritten to use non-aligned 
memory accesses. As with ordinary array references, the compiler will opti- 
mize away the redundant references. 
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Example 6—15. Vector Sum Modified to use Non-Aligned Memory Accesses 


void vec_sum(const short *restrict a, const short *restrict b, 
short *restrict c, int len) 
{ 
int i; 
unsigned a3_a2, al_a0; 
unsigned b3_b2, b1_b0; 
unsigned c3_c2, cl_c0; 


for (i 7] len; i += 4) 

{ 
a3_a2 = i (_memd8_const (&a[i])); 
al_aO = (_memd8_const (éa[i])); 


b3_b2 = i (_memd8_const (&b[i])); 
b1l_bO = _lo(_memd8_const (&b[i])); 


c3_c2 = _add2(b3_b2, a3_a2); 
cl_cO = _add2(b1_b0, al_a0O); 


_memd8 (éc[i]) = _itod(c3_c2, cl_c0O); 
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6.2.8.2 When to Use Non-Aligned Memory Accesses 


As noted earlier, the C64x can provide 128 bits/cycle bandwidth with aligned 
memory accesses, and 64 bits/cycle bandwidth with non-aligned memory ac- 
cesses. Therefore, it is important to use non-aligned memory accesses in 
places where they provide a true benefit over aligned memory accesses. Gen- 
erally, non-aligned memory accesses are a win in places where they allow a 
routine to be vectorized, where aligned memory accesses could not. These 
places can be broken down into several cases: 


(j} Generic routines which cannot impose alignment, 


1 Single sample algorithms which update their input or output pointers by 
only one sample 


(41 Nested loop algorithms where outer loop cannot be unrolled, and 


J Routines which have an irregular memory access pattern, or whose ac- 
cess pattern is data-dependent and not known until run time. 


An example of a generic routine which cannot impose alignment on routines 
that call it would be a library function such as memcpy or strcmp. Single-sam- 
ple algorithms include adaptive filters which preclude processing multiple out- 
puts at once. Nested loop algorithms include 2-D convolution and motion es- 
timation. Data-dependent access algorithms include motion compensation, 
which must read image blocks from arbitrary locations in the source image. 


In each of these cases, itis extremely difficult to transform the problem into one 
which uses aligned memory accesses while still vectorizing the code. Often, 
the result with aligned memory accesses is worse than if the code were not 
optimized for packed data processing at all. So, for these cases, non-aligned 
memory accesses are a win. 


Incontrast, non-aligned memory accesses should not be used in more general 
cases where they are not specifically needed. Rather, the program should be 
structured to best take advantage of aligned memory accesseswith a packed 
data processing flow. The following checklist should help. 


(41 Use signed short or unsigned char data types for arrays where possible. 
These are the types for which the ’C64x provides the greatest support. 


(J Round loop counts, numbers of samples, and so on to multiples of 4 or 8 
where possible. This allows the inner loop to be unrolled more readily to 
take advantage of packed data processing. 


(1 In nested loop algorithms, unroll outer loops to process multiple output 
samples at once. This allows packed data processing techniques to be ap- 
plied to elements that are indexed by the outer loop. 
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cca | 
Note: 


The default alignment for global arrays is double—word alignment on the 
C6400 CPU. Please consult the TMS320C6000 Optimizing C Compiler 


User’s Guide for details. 
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‘C64x Programming Considerations 6-41 


Accessing Packed-Data Processing on the ’C64x 


6.2.9 Performing Conditional Operations with Packed Data 


The ’C64x provides a set of operations that are intended to provide conditional 
data flow in code that operates on packed data. These operations make it pos- 
sible to avoid breaking the packed data flow with unpacking code and tradition- 
al ‘if’ statements. 


Common conditional operations, such as maximum, minimum and absolute 
value are addressed directly with their own specialized intrinsics. In addition 
to these specific operations, more generalized compare and select operations 
can be constructed using the packed compare intrinsics, _cmpXX2 and 


_cmpXX4, in conjunction with the expand intrinsics, _xpnd2 and _xpnd4. 


The packed compare intrinsics compare packed data elements, producing a 
small bitfield which describes the results of the independent comparisons. For 


_cmpeq2,_cmpgt2, and_cmplt2, the intrinsic returns a two bit field containing 


the results of the two separate comparisons. For cmpeq4, cmpgtu4, and 


_cmpltu4, the intrinsic returns a four bit field containing the results of the four 


separate comparisons. In both sets of intrinsics, a 1 bit signifies that the tested 
condition is true, and a 0 signifies that it is false. Figure 6-21 and Figure 6-22 
illustrate how these compare intrinsics work. 


Figure 6-21. Graphical Illustration of _cmpXxX2 Intrinsics 


1 
| The _cmpXX2 operation 
| 


c = cmpXX2(a, b) 
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Figure 6-22. Graphical Illustration of _cmpXX4 Intrinsics 
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The expand intrinsics work from a bitfield such as the bitfield returned by the 
compare intrinsics. The _xpnd2 and _xpnd4 intrinsics expand the lower 2 or 
4 bits of a word to fill the entire 32-bit word of the result. The _xpnd2 intrinsic 
expands the lower two bits of the input to two half-words, whereas _xpnd4 ex- 
pands the lower four bits to four bytes. The expanded output is suitable for use 
as a mask, for instance, for selecting values based on the result of a compari- 
son. Figure 6—23 and Figure 6—24 illustrate. 
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Example 6-16 illustrates an example that can benefit from the packed 
compare and expand intrinsics in action. The Clear Below Threshold kernel 
scans an image of 8-bit unsigned pixels, and sets all pixels that are below a 
certain threshold to 0. 


Example 6—16. Clear Below Threshold Kernel 


void clear_below_thresh(unsigned char *restrict image, int count, 
unsigned char threshold) 
{ 


int i; 


for (1 = QO; i < count; i++) 


{ 


if (image[i] <= threshold) 
image[i] = 0; 


Vectorization techniques are applied to the code (as described in Section 8.2), 
giving the result shown in Example 6-17. The _cmpgtu4() intrinsic compares 
against the threshold values, and the _ xpnd4() intrinsic generates a mask for 
setting pixels to 0. Note that the new code has the restriction that the input 
image must be double-word aligned, and must contain a multiple of 8 pixels. 
These restrictions are reasonable as common image sizes have a multiple of 
8 pixels. 
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Example 6-17. Clear Below Threshold Kernel, Using _cmpgtu4 and _xpnd4 Intrinsics 


void clear_below_thresh(unsigned char *restrict image, int count, 
unsigned char threshold) 

{ 
dnt: ak? 
unsigned t3_t2_t1_t0; /* Threshold (replicated) 
unsigned p7_p6_p5_p4, p3_p2_pl1_p0; /* Pixels 
unsigned c7_c6_c5_c4, c3_c2_cl_c0; /* Comparison results 
unsigned x7_x6_x5_x4, x3_x2_x1_x0; /* Expanded masks 


/* Replicate the threshold value four times in a single word */ 
temp = _pack2 (threshold, threshold); 
t3_t2_t1_t0O = _packl4(temp, temp); 


for (i ; i < count; i += 8) 


{ 


/* Load 8 pixels from input image (one double-word). 
p7_p6_p5_p4 = _hi(_amemd8 (&image[i])); 
)) 


p3_p2_pl_p0 = _lo(_amemd8 (&image[i] 


, 


/* Compare each of the pixels to the threshold. 
c7_c6_c5_c4 = _cmpgtu4 (p7_p6_p5_p4, t3_t2_t1_t0); 
c3_c2_cl_cO = _cmpgtu4 (p3_p2_pl1_p0, t3_t2_t1_t0); 


/* Expand the comparison results to generate a bitmask. 
x7_x6_x5_x4 = _xpnd4(c7_c6_c5_c4); 
*3_x2_x1_x0 = _xpnd4(c3_c2_cl_c0); 


/* Apply mask to the pixels. Pixels that were less than or 
/* equal to the threshold will be forced to 0 because the 
/* corresponding mask bits will be all Os. The pixels that 
/* were greater will not be modified, because their mask 

/* bits will be all ls. 

p7_p6_p5_p4 = p7_p6_p5_p4 & x7_x6_x5_x4; 

p3_p2_pl_p0 = p3_p2_pl_p0 & x3_x2_x1_x0; 


/* Store the thresholded pixels back to the image. 
_amemd8 (&image[i]) = _itod(p7_p6_p5_p4, p3_p2_p1_p0); 
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6.3 Linear Assembly Considerations 


The ’C64x supports linear assembly programming via the C6000 Assembly 
Optimizer. The operation of the Assembly Optimizer is described in detail in 
the Optimizing C/C++ Compiler User’s Guide. This section covers 'C64x spe- 
cific aspects of linear assembly programming. 


6.3.1. Using BDEC and BPOS in Linear Assembly 


The ’C64x provides two new instructions, BDEC and BPOS, which are de- 
signed to reduce codesize in loops, as well as to reduce pressure on predica- 
tion registers. The BDEC instruction combines a decrement, test, and branch 
into a single instruction. BPOS is similar, although it does not decrement the 
register. For both, these steps are performed in the following sequence. 


_) Testthe loop register to see if itis negative. If itis negative, no further action 
occurs. The branch is not taken and the loop counter is not updated. 


_j If the loop counter was not initially negative, decrement the loop counter 
and write the new value back to the register file. (This step does not occur 
for BPOS .) 


LJ If the loop counter was not initially negative, issue the branch. Code will 
begin executing at the branch’s destination after the branch’s delay slots. 
From linear assembly, the branch appears to occur immediately, since lin- 
ear assembly programming hides delay slots from the programmer. 


This sequence of events causes BDEC to behave somewhat differently than 
a separate decrement and predicated branch. First, the decision to branch oc- 
curs before the decrement. Second, the decision to branch is based on wheth- 
er the number is negative, rather than whether the number is zero. Together, 
these effects require the programmer to adjust the loop counter in advance of 
a loop. 


Consider Example 6-18. In this C code, the loop iterates for count iterations, 
adding 1 to iters each iteration. After the loop, iters contains the number of 
times the loop iterated. 
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Example 6—18. Loop Trip Count in C 


int count_loop_iterations (int count) 


{ 


int iters, i; 


iters = 0; 


for (i = count; i > 0; i--) 
iterstt; 


return iters; 


Without BDEC and BPOS, this loop would be written as shown in 
Example 6-19 below. This example uses branches to test whether the loop 
iterates at all, as well as to perform the loop iteration itself. This loop iterates 
exactly the number of times specified by the argument ’count’. 


Example 6-19. Loop Trip Count in Linear Assembly without BDEC 


-global _count_loop_iterations 
_count_loop_iterations .cproc count 
.reg i, iters, flag 


ZERO iters ; Initialize our return value to 0. 


CMPLT count, ay flag 
B does_not_iterate ; Do not iterate if count 


count, i ; 1 = count 
This loop is guaranteed to iterate at 
least once. 


iters++ 
1-- 
; while (i > 0); 


does_not_iterate: 


.-return iters ; Return our number of iterations. 
-endproc 


Using BDEC , the loop is written similarly. However, the loop counter needs to 
be adjusted, since BDEC terminates the loop after the loop counter becomes 
negative. Example 6-20 illustrates using BDEC to conditionally execute the 
loop, as well as to iterate the loop. In the typical case, the loop count needs 
to be decreased by 2 before the loop. The SUB and BDEC before the loop per- 
form this update to the loop counter. 
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Example 6-20. Loop Trip Count Using BDEC 


-global _count_loop_iterations 
_count_loop_iterations .cproc count 
.reg i, iters 


ZI iters ; Initialize our return value to 0. 


S count, i ; i = count - 1; 
BDI loop, i ; Do not iterate if count < 1. 


does_not_iterate: 
.-return iters ; Loop does not iterate, just return 0. 


loop: .trip ; This loop is guaranteed to iterate at 
least once. 


iterst++ 
while (i-- >= 0); 


.return iters ; Return our number of iterations. 
-endproc 


Another approach to using BDEC is to allow the loop to execute extra itera- 
tions, and then compensate for these iterations after the loop. This is particu- 
larly effective in cases where the cost of the conditional flow before the loop 
is greater than the cost of executing the body of the loop, as in the example 
above. Example 6-21 shows one way to apply this modification. 


Example 6-21. Loop Tip Count Using BDEC With Extra Loop Iterations 


-global _count_loop_iterations 
_count_loop_iterations .cproc count 
.reg i, iters 


MVK iters ; Loop executes exactly 1 extra iteration, 
so start with the iteration count == -1l. 


iy all ; Force "count==0”" to iterate exactly once. 


This loop is guaranteed to iterate at 
least once. 


iters, ; iters++ 
loop, 1 ; while (i-- >= 0); 


.return iters ; Return our number of iterations. 
-endproc 
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6.3.1.1. Function Calls and ADDKPC in Linear Assembly 


The ’'C64x provides a new instruction, ADDKPC , which is designed to reduce 
codesize when making function calls. This new instruction is not directly ac- 
cessible from Linear Assembly. However, Linear Assembly provides the func- 
tion call directive, .call, and this directive makes use of ADDKPC. The .call di- 
rective is explained in detail in the TMS320C6000 Optimizing C/C++ Compiler 
User’s Guide. 


Example 6—22 illustrates a simple use of the .call directive. The Assembly Op- 
timizer issues an ADDKPC as part of the function call sequence for this .call, 
as shown in the compiler output in Example 6-23. 


Example 6-22. Using the .call Directive in Linear Assembly 


.data 
hello .-string "Hello World”, 0 


-text 

-global _puts 

-global _main 

.cproc 

.reg pointer 
hello, pointer ; Generate a 32-bit pointer to the 
hello, pointer ; phrase "Hello World”. 
_puts (pointer) ; Print the string ”Hello World”. 


loop ; Keep printing it. 


-endproc 


~Call _puts (pointer) ; 
B Zo. _puts 7 
MVKL ~S1 hello, A4 ; 
ADDKPC aoe RLO,B3,2 ; 
a. 
i 


Print the string "Hello World”. 


Generate a 32-bit pointer to the 


MVKH -S1 hello, A4 
RLO: ; CALL OCCURS 


phrase “Hello World”. 
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6.3.1.2 Using .mptr and .mdep With Linear Assembly on the ’C64x 


The Assembly Optimizer supports the .mptr and .mdep directives on the 
’C64x. These directives allow the programmer to specify the memory access 
pattern for loads and stores, as well as which loads and stores are dependent 
on each other. Section 5.2, Assembly Optimizer Options and Directives, de- 
scribes these directives in detail. This section describes the minor differences 
inthe behavior of the .mptr directive on ’C64x vs. other C6000 family members. 


Most ’C64x implementations will have different memory bank structure than 
existing ‘'C62x implementations in order to support the wider memory ac- 
cesses that the ’C64x provides. Refer to the TMS320C6000 Peripherals Ref- 
erence Guide (SPRU190) for specific information on the part that you are us- 


ing. 


Additionally, the *C64x’s non-aligned memory accesses do not cause bank 
conflicts. This is due to the fact that no other memory access can execute in 
parallel with a non-aligned memory access. As a result, the.mpitr directive has 
no effect on non-aligned load and store instructions. 


6.3.2 Avoiding Cross Path Stalls 
6.3.2.1. Architectural Considerations 
The C6000 CPU components consist of: 
_j Two general-purpose register files (A and B) 
_) Eight functional units (.L1, .L2, .S1, .S2, .M1, .M2, .D1, and .D2) 
Two load—from—memory data paths (LD1 and LD2) 


Two store—to—memory data paths (ST1 and ST2) 


UO oO wo 


Two data address paths (DA1 and DA2) 


_) Two register file data cross paths (1X and 2X) 


6.3.2.2 Register File Cross Paths 


The functional unit is where the instructions (ADD, MPY etc.) are executed. 
Each functional unit reads directly from and writes directly to the register file 
within its own data path. Thatis, the .L1,.S1,.D1,and.M1 units write to register 
file A and the .L2, .S2, .D2, and .M2 units write to register file B. 


The register files are also connected to the opposite—side register file’s func- 
tional units via the 1X and 2X cross paths. These cross paths allow functional 
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units from one data path to access a 32-bit operand from the opposite side’s 
register file. The 1X cross path allows data path A’s functional units to read their 
source from register file B. Similarly, the 2X cross path allows data path B’s 
functional units to read their source from register file A. Figure 6—25 illustrates 
how these register file cross paths work. 


Figure 6-25. C64x Data Cross Paths 


C64x data cross paths 


Register AO-A31 Register BO-B31 


DA1 DA2 
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On the ’C64x, all eight of the functional units have access to the opposite side’s 
register file via a cross path. Only two cross paths, 1X and 2X, exist in the 
C6000 architecture. Therefore, the limit is one source read from each data 
path’s opposite register file per clock cycle, or a total of two cross—path source 
reads per clock cycle. The ’C64x pipelines data cross path accesses allowing 
multiple functional units per side to read the same cross-—path source simulta- 
neously. Thus the cross path operand for one side may be used by up to two 
of the functional units on that side in an execute packet. In the ’C62x/’C67x, 
only one functional unit per data path, per execute packet can get an operand 
from the opposite register file. 


On the ’C64x, a delay clock cycle is introduced whenever an instruction at- 
tempts to read a source register via a cross path where that register was up- 
dated in the previous cycle. This is known as across path stall. This stall is in- 
serted automatically by the hardware; no NOP instruction is needed. For more 
information, see the TMS320C6000 CPU and Instruction Set Reference 
Guide (SPRU189). This cross path stall does not occur on the ’C62x/’C67x. 
This cross path stall is necessary so that the ’C64x can achieve clock rate 
goals beyond 1GHz. It should be noted that all code written for the C62x/’C67x 
that contains cross paths where the source register was updated in the pre- 
vious cycle will contain one clock stall when running on the ’C64x. The code 
will still run correctly, but it will take an additional clock cycle. 
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It is possible to avoid the cross path stall by scheduling instructions such that 
across path operand is not read until at least one clock cycle after the operand 
has been updated. With appropriate scheduling, the 'C64x can provide one 
cross path operand per data path per clock cycle with no stalls. In many cases, 
the TMS320C6000 Optimizing C Compiler and Assembly Optimizer automati- 
cally perform this scheduling as demonstrated in Example 6-24. 


Below is a C implementation of a weighted vector sum. Each value of input 
array ais multiplied by a constant, m, and then is shifted to the right by 15 bits. 
This weighted input is now added to a second input array, b, with the weighted 
sum stored in output array, c. 


Example 6-24. Avoiding Cross Path Stalls: Weighted Vector Sum Example 


int w_vec(short a[],short b[], short c[], short m, int n) 


This algorithm requires two loads, a multiply, a shift, an add, and a store. Only 
the .D units on the C6000 architecture are capable of loading/storing values 
from/to memory. Since there are two .D units available, it would appear this 
algorithm would require two cycles to produce one result considering three .D 
operations are required. Be aware, however, that the input and output arrays 
are short or 16-bit values. Both the ’C62x and ’C64x have the ability to load/ 
store 32-bits per .D unit. (The ’C64x is able load/store 64—bits per .D unit as 
well.). By unrolling the loop once, it may be possible to produce two 16-bit re- 
sults every two clock cycles. 


Now, examine further a partitioned linear assembly version of the weighted 
vector sum, where data values are brought in 32-bits at a time. With linear as- 
sembly, it is not necessary to specify registers, functional units or delay slots. 
In partitioned linear assembly, the programmer has the option to specify on 
what side of the machine the instructions will execute. We can further specify 
the functional unit as seen below in Example 6-25. 
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Example 6-25. Avoiding Cross Path Stalls: Partitioned Linear Assembly 


-global _w_vec 

»sCOroc @, by cy, m 

.reg ak ot, bio ai1,. pi, pil; pit, pais; pilus 
.reg mask, bi, bil, ci, cil, cl; cntr 


-1, mask 

O, mask ; generate a mask = OxOOOOFFFF 

50; -entxr ; load loop count with 50 

2; 6; el ; cl is offset by 2(16-bit values)fromc 


this loop will run a minimum of 50 times 


katt+,ai_il jload 32-bits (an & antl) 
*b++,bi_il ;load 32-bits (bn & bn+1) 
ai_il, m, pi ;multiply an by a constant ; prodod 
ai_il, m, pil ;multiply an+l by a constant; prodl 
pi, 15, pis ;shift prodO right by 15 -> sprodd 
pil; 15; pilcs ;shift prodl right by 15 -> sprodl 
bi_il, mask, bi ;AND bn & bn+l1 w/ mask to isolate bn 
bi_il, 16, bil j;shift bn & bn+1 by 16 to isolate bnt+l 
pi_s, bi, ci ;add sprod0 + bn 

pil_s, bil, cil ;add sprodl + bntl 

cL, *er+(2] ;store 16-bits (cn) 

cil, *el++[2] ;store 16-bits (cn+t1) 

entr, 1, ener ;decrement loop count 

LOOP ;branch to loop if loop count > 0 


Inthe implementation above, 16-bit values two at a time with the LDW instruc- 
tion into a single 32-bit register. Each 16-bit value is multiplied in register ai_i1 
by the short (16—bit) constant m. Each 32-bit product is shifted to the right by 
15 bits. The second input array is also brought in two 16-bit values at a time 
into a single 32-bit register, bi_i1. bi_i1 is ANDed with a mask that zeros the 
upper 16—bits of the register to create bi (a single 16-bit value). bi_i1 is also 
shifted to the right by 16 bits so that the upper 16—bit input value can be added 
to the corresponding weighted input value. 
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The code above is sent to the assembly optimizer with the following compiler 
options: -03, —mi, —mt, -k, and —mg. Since a specific C6000 platform was not 
specified , the default is to generate code for the ’C62x. The —03 option enables 
the highest level of the optimizer. The —mi option creates code with an interrupt 
threshold equal to infinity. In other words, interrupts will never occur when this 
code runs. The —-k option keeps the assembly language file and —mt indicates 
that the programmer is assuming no aliasing. Aliasing allows multiple pointers 
to point to the same object). The —mg option allows profiling to occur in the de- 
bugger for benchmarking purposes. 


Example 6-26 below, is the assembly output generated by the assembly opti- 
mizer for the weighted vector sum loop kernel: 


Example 6-26. Avoiding Cross Path Stalls: Vector Sum Loop Kernel 


ED LOOP KERNEL 


.L2X A3,B6,B8 ;AND bn & bnt+1l with mask to isolate bn 
AO, Oxf, A0 ; shift prodQ right by 15 -> sprod0d 
B2,A5,A0 ; multiply an by constant ; prodod 
LOOP ; branch to loop if loop count >0 
Oxffffffff,Al,Al ; decrement loop count 

*A74++,A3 ; load 32-bits (bn & bn+1) 
*B5++, B2 ; load 32-bits (an & antl) 


2,A2,A2 7 

Bl, *B4++ (4) ; store 16-bits (cn+1) 
, 
r 


A6, *A8++ (4) store 16-bits (cn) 
A4,B0,A6 add sprodl + bnt+l 
B8,A0,Bl1 


; add sproddO + bn 
B9,0xf,BO ; shift prodl right by 15 -> sprodl 
A3,0x10,A4 ; shift bn & bn+l by 16 to isolate bn+l 
B2,B7,B9 ; multiply an+l by a constant ; prodl 


This two-cycle loop produces two 16-bit results per loop iteration as planned. 


If the code is used on the ’C64x, be aware that in the first execute packet that 
AO (prod) is shifted to the right by 15, causing the result to be written back into 
AO. Inthe next execute packet and therefore the next clock cycle, AO (sprod0) 
is used as across path operand to the .L2 functional unit. If this code were run 
on the ’C64x, it would exhibit a one cycle clock stall as described above. AO 
in cycle 2 is being updated and used as a cross path operand in cycle 3. If the 
code performs as planned, the two-cycle loop would now take three cycles to 
execute. 


The cross path stall can, in most cases, be avoided, if the —mv6400 option is 
added to the compiler options list. This option indicates to the compiler/assem- 
bly optimizer that the code below will be run on the ’C64x core. 
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In Example 6-27 below, the assembly output generated by the assembly opti- 
mizer for the weighted vector sum loop kernel compiled with the -mv6400 —03 
—mt —mi —k —mg options: 


Example 6-27. Avoiding Cross Path Stalls: Assembly Output Generated for Weighted 
Vector Sum Loop Kernel 


ERNEL 


Ao, *A8++ (4) ; store 16-bits (cn) 

B9,A16,B9 ; add bn + copy of sprod0d 

A3,A16 ; copy sprod0 to another register 
A5,0x10,A3 ; shift bn & bnt+l by 16 to isolate bnt+l 
LOOP, BO ;branch to loop & decrement loop count 
B17,A7,A4 ; multiply an by a constant ; prod0d 
B17,B4,B16 7 multiply ant+l by a constant ; prodl 
*Bo++,B17 ; load 32-bits (an & antl) 


B9, *B7++ (4) ; store 16-bits (cn+1) 

A3,B8,A6 ; add bn+1 + sprodl 

A5,B5,B9 ; AND bn & bnt+l with mask to isolate bn 
B16,0xf,B8 ; shift prodl right by 15 -> sprodl 
A4,0xf,A3 ; shift prodO right by 15 -> sprod0d 
*A9++,A5 ; load 32-bits (bn & bnt+1) 


In Example 6-27, the assembly optimizer has created a two-cycle loop with- 
out across path stall. The loop count decrement instruction and the conditional 
branch to loop based on the value of loop count instruction have been replaced 
with a single BDEC instruction. In the instruction slot created by combining 
these two instructions into one, a MV instruction has been placed. The MV in- 
struction copies the value in the source register to the destination register. The 
value in A3 (sprod0) is placed into A16. A16 is then used as across path oper- 
and to the .L2 functional unit. A16 is updated every two cycles. For example, 
A16 is updated in cycles 2, 4, 6, 8 etc. The value of A16 from the previous loop 
iteration is used as the cross path operand to the .L2 unit in cycles 2, 4, 6, 8 
etc. This rescheduling prevents the cross path stall. Again, There are two— 
cycle loop with two 16-bit results produced per loop iteration. Further opti- 
mization of this algorithm can be achieved by unrolling the loop one more time. 
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Structure of Assembly Code 


An assembly language program must be an ASCII text file. Any line of 
assembly code can include up to seven items: 


Lj Label 

_j Parallel bars 

_j Conditions 

_j Instruction 

LJ Functional unit 

_} Operands 

_j Comment 
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Labels / Parallel Bars 


7.1. Labels 


A label identifies a line of code or a variable and represents a memory address 
that contains either an instruction or data. 


Figure 7—1 shows the position of the label in a line of assembly code. The colon 
following the label is optional. 


Figure 7-1. Labels in Assembly Code 


7.2 Parallel Bars 


label: parallel bars [condition] instruction unit operands ; comments 


Labels must meet the following conditions: 


(1 The first character of a label must be a letter or an underscore (_) followed 
by a letter. 


(1 The first character of the label must be in the first column of the text file. 


(j Labels can include up to 32 alphanumeric characters. 


An instruction that executes in parallel with the previous instruction signifies 
this with parallel bars (||). This field is left blank for an instruction that does not 
execute in parallel with the previous instruction. 


Figure 7-2. Parallel Bars in Assembly Code 


label: parallelbars [condition] instruction unit operands ; comments 


Conditions 


7.3 Conditions 


Five registers on the ’'C62x/’C67x are available for conditions: A1, A2, BO, B1, 
and B2. Six registers on the ’C64x are available for conditions: AO, A1, A2, BO, 


B1, and B2. Figure 7—3 shows the position of a condition in a line of assembly 
code. 


Figure 7-3. Conditions in Assembly Code 


label: parallelbars [condition] instruction unit operands ; comments 


All C6000 instructions are conditional: 


_j If no condition is specified, the instruction is always performed. 


_j If a condition is specified and that condition is true, the instruction 
executes. For example: 


With this condition... The instruction executes if ... 
[Al] A1!=0 
[!A1] A1=0 


Lj If acondition is specified and that condition is false, the instruction does 
not execute. 


With this condition... The instruction does not execute if ... 
[Al] A1=0 
[!A1] A1!=0 
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7.4 


Instructions 


Assembly code instructions are either directives or mnemonics: 


Ly 


Assembler directives are commands for the assembler (asm6x) that 
control the assembly process or define the data structures (constants and 
variables) in the assembly language program. All assembler directives 
begin with a period, as shown in the partial list in Table 7-1. See the 
TMS320C6000 Assembly Language Tools User’s Guide for a complete 
list of directives. 


Processor mnemonics are the actual microprocessor instructions that 
execute at runtime and perform the operations in the program. Processor 
mnemonics must begin in column 2 or greater. For more information about 
processor mnemonics, see the TMS320C6000 CPU and Instruction Set 
User’s Guide. 


Figure 7—4 shows the position of the instruction in a line of assembly code. 


Figure 7—4. Instructions in Assembly Code 


label: parallel bars [condition] instruction unit operands ; comments 


Table 7-1. Selected TMS320C6x Directives 


Directives Description 
-sect “name” Creates section of information (data or code) 
.double value Reserve two consecutive 32 bits (64 bits) in memory and 


fill with double-precision (64-bit) IEEE floating-point rep- 
resentation of specified value 


float value Reserve 32 bits in memory and fill with single-precision 
(32-bit) IEEE floating-point representation of specified 
value 

-int value Reserve 32 bits in memory and fill with specified value 

long value 

.word value 

-short value Reserve 16 bits in memory and fill with specified value 

-half value 

-byte value Reserve 8 bits in memory and fill with specified value 


See the TMS320C6000 Assembly Language Tools User’s Guide for a com- 
plete list of directives. 


Functional Units 


7.5 Functional Units 


The ’C6000 CPU contains eight functional units, which are shown in 
Figure 7-5 and described in Table 7-2. 


Figure 7-5. TMS320C6x Functional Units 
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Table 7-2. Functional Units and Operations Performed 


Functional Unit 
-L unit (.L1, .L2) 


.S unit (.S1, .S2) 
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Fixed-Point Operations 


32/40-bit arithmetic and compare 
operations 


32-bit logical operations 

Leftmost 1 or 0 counting for 32 bits 
Normalization count for 32 and 40 bits 
Byte shifts 

Data packing/unpacking 

5-bit constant generation 

Dual 16-bit arithmetic operations 
Quad 8-bit arithmetic operations 
Dual 16-bit min/max operations 
Quad 8-bit min/max operations 


32-bit arithmetic operations 


32/40-bit shifts and 32-bit bit-field 
operations 


32-bit logical operations 
Branches 
Constant generation 


Register transfers to/from control register 
file (.S2 only) 


Byte shifts 

Data packing/unpacking 

Dual 16-bit compare operations 

Quad 8-bit compare operations 

Dual 16-bit shift operations 

Dual 16-bit saturated arithmetic 
operations 

Quad 8-bit saturated arithmetic 
operations 


Floating—Point Operations 
Arithmetic operations 


DP => SP, INT > DP, INT > SP 
conversion operations 


Compare 


Reciprocal and reciprocal square-root 
operations 


Absolute value operations 


SP — DP conversion operations 


Functional Units 


Table 7-2. Functional Units and Operations Performed (Continued) 


Functional Unit 
-M unit (.M1, .M2) 


.D unit (.D1, .D2) 


Fixed-Point Operations 
16 x 16 multiply operations 


16 x 32 multiply operations 
Quad 8 x 8 multiply operations 
Dual 16 x 16 multiply operations 


Dual 16 x 16 multiply with 
add/subtract operations 


Quad 8 x 8 multiply with add operation 

Bit expansion 

Bit interleaving/de-interleaving 

Variable shift operations 

Rotation 

Galois Field Multiply 

32-bit add, subtract, linear and circular 
address calculation 

Loads and stores with 5-bit constant offset 


Loads and stores with 15-bit constant 
offset (.D2 only) 


Dual 16-bit arithmetic operations 


Load and store double words with 5-bit 
constant 


Load and store non-aligned words and 
double words 


5-bit constant generation 


Floating-Point Operations 
32 X 32-bit fixed-point multiply operations 


Floating—point multiply operations 


Load doubleword with 5—bit constant offset 


32-bit logical operations 


Note: _Fixed-point operations are available on all three devices. Floating-point operations and 32 x 32-bit fixed-point multiply are 
available only on the ’C67x. Additonal ’C64x functions are shown in bold. 


Figure 7-6 shows the position of the unit in a line of assembly code. 


Figure 7-6. Units in the Assembly Code 


label: parallel bars [condition] 


instruction unit operands ; comments 


Specifying the functional unit in the assembly code is optional. The functional 
unit can be used to document which resource(s) each instruction uses. 
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7.6 Operands 


The ’C6000 architecture requires that memory reads and writes move data 
between memory and a register. Figure 7-7 shows the position of the oper- 
ands in a line of assembly code. 


Figure 7-7. Operands in the Assembly Code 


label: parallel bars [condition] instruction unit operands  ; comments 


Instructions have the following requirements for operands in the assembly 
code: 


(1 All instructions require a destination operand. 
_j Most instructions require one or two source operands. 


Lj The destination operand must be in the same register file as one source 
operand. 


[1 One source operand from each register file per execute packet can come 
from the register file opposite that of the other source operand. 


When an operand comes from the other register file, the unit includes an X, 
as shown in Figure 7-8, indicating that the instruction is using one of the 
cross paths. (See the TMS320C6000 CPU and Instruction Set Reference 
Guide for more information on cross paths.) 


Figure 7-8. Operands in Instructions 


ADD -L1 A0O,A1,A3 


ADD .L1X A0O,B1,A3 


I 


All registers except B1 are on the same side of the CPU. 


The ’C6000 instructions use three types of operands to access data: 
_j Register operands indicate a register that contains the data. 


1 Constant operands specify the data within the assembly code. 


_] Pointer operands contain addresses of data values. 


Only the load and store instructions require and use pointer operands to 
move data values between memory and a register. 


Comments 


7.7 Comments 


As with all programming languages, comments provide code documentation. 
Figure 7-9 shows the position of the comment in a line of assembly code. 


Figure 7-9. Comments in Assembly Code 


label: parallel bars [condition] instruction unit operands ; comments 


The following are guidelines for using comments in assembly code: 


_) Acomment may begin in any column when preceded by a semicolon (;). 
_) Acomment must begin in first column when preceded by an asterisk (*). 
_} Comments are not required but are recommended. 
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Chapter 8 


Interrupts 


This chapter describes interrupts from a software-programming point of view. 
A description of single and multiple register assignment is included, followed 
by code generation of interruptible code and finally, descriptions of interrupt 
subroutines. 
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Overview of Interrupts 


8.1 Overview of Interrupts 


An interrupt is an event that stops the current process in the CPU so that the 
CPU can attend to the task needing completion because of another event. 
These events are external to the core CPU but may originate on-chip or off- 
chip. Examples of on-chip interrupt sources include timers, serial ports, DMAs 
and external memory stalls. Examples of off-chip interrupt sources include 
analog-to-digital converters, host controllers and other peripheral devices. 


Typically, DSPs compute different algorithms very quickly within an asynchro- 
nous system environment. Asynchronous systems must be able to control the 
DSP based on events outside of the DSP core. Because certain events can 
have higher priority than algorithms already executing on the DSP, it is some- 
times necessary to change, or interrupt, the task currently executing on the 
DSP. 


The ‘C6000 provides hardware interrupts that allow this to occur automatically. 
Once an interrupt is taken, an interrupt subroutine performs certain tasks or 
actions, as required by the event. Servicing an interrupt involves switching 
contexts while saving all state of the machine. Thus, upon return from the inter- 
rupt, operation of the interrupted algorithm is resumed as if there had been no 
interrupt. Saving state involves saving various registers upon entry to the inter- 
rupt subroutine and then restoring them to their original state upon exit. 


This chapter focuses on the software issues associated with interrupts. The 
hardware description of interrupt operation is fully described in the 
TMS320C6000 CPU and Instruction Set Reference Guide. 


In order to understand the software issues of interrupts, we must talk about two 
types of code: the code that is interrupted and the interrupt subroutine, which 
performs the tasks required by the interrupt. The following sections provide in- 
formation on: 


(1 Single and multiple assignment of registers 

(4 Loop interruptibility 

(1 How to use the ’C6000 code generation tools to satisfy different require- 
ments 

Lj Interrupt subroutines 


Single Assignment vs. Multiple Assignment 


8.2 Single Assignment vs. Multiple Assignment 


Register allocation on the C6000 can be classified as either single assignment 
or multiple assignment. Single assignment code is interruptible; multiple as- 
signment is not interruptible. This section discusses the differences between 
each and explains why only single assignment is interruptible. 


Example 8—1 shows multiple assignment code. The term multiple assignment 
means that a particular register has been assigned with more than one value 
(in this case 2 values). On cycle 4, at the beginning of the ADD instruction, reg- 
ister A1 is assigned to two different values. One value, written by the SUB in- 
struction on cycle 1, already resides in the register. The second value is called 
an in-flight value and is assigned by the LDW instruction on cycle 2. Because 
the LDW instruction does not actually write a value into register A1 until the end 
of cycle 6, the assignment is considered in-flight. 


In-flight operations cause code to be uninterruptible due to unpredictability. 
Take, for example, the case where an interrupt is taken on cycle 3. At this point, 
all instructions which have begun execution are allowed to complete and no 
new instructions execute. So, 3 cycles after the interrupt is taken on cycle 3, 
the LDW instruction writes to A1. After the interrupt service routine has been 
processed, program execution continues on cycle 4 with the ADD instruction. 
In this case, the ADD reads register A1 and will be reading the result of the 
LDW, whereas normally the result of the SUB should be read. This unpredict- 
ability means that in order to ensure correct operation, multiple assignment 
code should not be interrupted and is thus, considered uninterruptible. 


Example 8-1. Code With Multiple Assignment of A1 


cycle 


ao 8 W NHN FF 


SUB 
LDW 
NOP 
ADD 
NOP 


MPY 


A4,A5,Al ; writes to Al in single cycle 


*A0,AL1 ; writes to Al after 4 delay slots 


Al,A2,A3 ; uses old Al (result of SUB) 
2 


Al,A4,A5 ; uses new Al (result of LDW) 


Example 8-2 shows the same code with a new register allocation to produce 
single assignment code. Now the LDW assigns a value to register A6 instead 
of A1. Now, regardless of whether an interrupt is taken or not, A1 maintains 
the value written by the SUB instruction because LDW now writes to A6. Be- 
cause there are no in-flight registers that are read before an in-flight instruction 
completes, this code is interruptible. 
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Example 8-2. Code Using Single Assignment 


cycle 
1 SUB sod A4,A5,Al ; writes to Al in single cycle 
2 LDW .D1 *A0,A6 ; writes to Al after 4 delay slots 
3 NOP 
4 ADD -L1 Al,A2,A3 ; uses old Al (result of SUB) 
5-6 NOP 2 
7 MPY -M1 Ao6,A4,A5 ; uses new Al (result of LDW) 


Both examples involve exactly the same schedule of instructions. The only dif- 
ference is the register allocation. The single assignment register allocation, as 
shown in Example 8-2, can result in higher register pressure (Example 8-2 
uses one more register than Example 8-1). 


The next section describes how to generate interruptible and non-interruptible 
code with the C6000 code generation tools. 
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8.3 Interruptible Loops 


Even if code employs single assignment, it may not be interruptible in a loop. 
Because the delay slots of all branch operations are protected from interrupts 
in hardware, all interrupts remain pending as long as the CPU has a pending 
branch. Since the branch instruction on the ‘C6000 has 5 delay slots, loops 
smaller than 6 cycles always have a pending branch. For this reason, all loops 
smaller than 6 cycles are uninterruptible. 


There are two options for making a loop with an iteration interval less than 6 
interruptible. 


1) Simply slow down the loop and force an iteration interval of 6 cycles. This 
is not always desirable since there will be a performance degradation. 


2) Unroll the loop until an iteration interval of 6 or greater is achieved. This 
ensures at least the same performance level and in some cases can im- 
prove performance (see section 5.9, Loop Unrolling and section 8.4.4, 
Getting the Most Performance Out of Interruptible Code). The disadvan- 
tage is that code size increases. 


The next section describes how to automatically generate these different op- 
tions with the ‘C6000 code generation tools. 
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8.4 


8.4.1 


Interruptible Code Generation 


The ’C6000 code generation tools provide a large degree of flexibility for inter- 
ruptibility. Various combinations of single and multiple assignment code can 
be generated automatically to provide the best tradeoff in interruptibility and 
performance for each part of an application. In most cases, code performance 
is not affected by interruptibility, but there are some exceptions: 


(1 Software pipelined loops that have high register pressure can fail to allo- 
cate registers at a given iteration interval when single assignment is re- 
quired, but might otherwise succeed to allocate if multiple assignment 
were allowed. This can result in a larger iteration interval for single assign- 
ment software pipelined loops and thus lower performance. To determine 
if this is a problem for looped code, use the -mw feedback option. If you 
see a “Cannot allocate machine registers” message after the message 
about searching for a software pipeline schedule, then you have a register 
pressure problem. 


(J Because loops with minimum iteration intervals less than 6 are not inter- 
ruptible, higher iteration intervals might be used which results in lower per- 
formance. Unrolling the loop, however, prevents this reduction in perfor- 
mance (See section 8.4.4.) 


(1 Higher register pressure in single assignment can cause data spilling to 
memory in both looped code and non-looped code when there are not 
enough registers to store all temporary values. This reduces performance 
but occurs rarely and only in extreme cases. 


The tools provide 3 levels of control to the user. These levels are described in 
the following sections. For a full description of interruptible code generation, 
see the TMS320C6000 Optimizing C/C++ Compiler User’s Guide. 


Level 0 - Specified Code is Guaranteed to Not Be Interrupted 


At this level, the compiler does not disable interrupts. Thus, it is up to you to 
guarantee that no interrupts occur. This level has the advantage that the com- 
piler is allowed to use multiple assignment code and generate the minimum 
iteration intervals for software pipelined loops. 


The command line option -mi (no value specified) can be used for an entire 
module and the following pragma can be used to force this level on a particular 
function: 


#pragma FUNC_INTERRUPT_THRESHOLD (func, uint_max); 


Interruptible Code Generation 


8.4.2 Level 1 — Specified Code Interruptible at All Times 


At this level, the compiler employs single assignment everywhere and never 
produces a loop of less than 6 cycles. The command line option —mi1 can be 
used for an entire module and the following pragma can be used to force this 
level on a particular function: 


#pragma FUNC_INTERRUPT_THRESHOLD (func, 1); 


8.4.3 Level 2 — Specified Code Interruptible Within Threshold Cycles 


The compiler will disable interrupts around loops if the specified threshold 
number is not exceeded. In other words, the user can specify a threshold, or 
maximum interrupt delay, that allows the compiler to use multiple assignment 
in loops that do not exceed this threshold. The code outside of loops can have 
interrupts disabled and also use multiple assignment as long as the threshold 
of uninterruptible cycles is not exceeded. If the compiler cannot determine the 
loop count of a loop, then it assumes the threshold is exceeded and will gener- 
ate an interruptible loop. 


The commandline option —mi (threshold) can be used for an entire module and 
the following pragma can be used to specify a threshold for a particular func- 
tion. 


#pragma FUNC_INTERRUPT_THRESHOLD (func, threshold); 
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8.4.4 Getting the Most Performance Out of Interruptible Code 


As stated in Chapter 4 and Chapter 7, the .trip directive and the MUST_ITER- 
ATE pragma can be used to specify a maximum value for the trip count of a 
loop. This information can help to prevent performance loss when your loops 
need to be interruptible as in Example 8-3. 


For example, if your application has an interrupt threshold of 100 cycles, you 
will use the -mi100 option when compiling your application. Assume that there 
is a dot product routine in your application as follows: 


Example 8-3. Dot Product With MUST_ITERATE Pragma Guaranteeing Minimum Trip 


Count 


int dot_prod(short *a, short *b, int n) 


{ 


int i, sum = 0; 


#pragma MUST_ITERATE (20); 


for (i = 0; 


sum += a[i] * b[il]; 


return sum; 


AS Te aE) 


With the MUST_ITERATE pragma, the compiler only knows that this loop will 
execute at least 20 times. Even with the interrupt threshold set at 100 by the 
-mi option, the compiler will still produce a 6-cycle loop for this code (with only 
one result computed during those six cycles) because the compiler has to ex- 
pect that a value of greater than 100 may be passed into n. 


After looking at the application, you discover that n will never be passed a value 
greater than 50 in the dot product routine. Example 8-4 adds this information 
to the MUST_ITERATE pragma as follows: 
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Example 8-4. Dot Product With _nassert Guaranteeing Trip Count Range 


int dot_prod(short *a, short *b, int n) 
{ 


int i, sum = 0; 


#pragma MUST_ITERATE (20,50); 


for (i = 0; i < nj itt) 
sum += ali] * b[il]; 


return sum; 


Now the compiler knows that the loop will complete in less than 100 cycles 
when it generates a 1-cycle kernel that must execute 50 times (which equals 
50 cycles). The total cycle count of the loop is now known to be less than the 
interrupt threshold, so the compiler will generate the optimal 1-cycle kernel 
loop. You can do the same thing in linear assembly code by specifying both 
the minimum and maximum trip counts with the .trip directive. 


a | 
Note: 


The compiler does not take stalls (memory bank conflict, external memory 
access time, cache miss, etc.) into account. Because of this, it is recom- 


mended that you are conservative with the threshold value. 
| | 


Let us now assume the worst case scenario - the application needs to be inter- 
ruptible at any given cycle. In this case, you will build your application with an 
interrupt threshold of one. It is still possible to regain some performance lost 
from setting the interrupt threshold to one. Example 8—5 shows where the fac- 
tor option in .trip and using the third argument of the MUST_ITERATE pragma 
are useful. For more information, see section 2.5.3.4, Loop Unrolling. 
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Example 8-5. Dot Product With MUST_ITERATE Pragma Guaranteeing Trip Count Range 


and Factor of 2 


int dot_prod(short *a, short *b, int n) 


{ 


int i, sum = 0; 


ERATE 


#pragma MUST_IT (2050-2); 


for (i = 0; i < n; itt) 


sum += a[i] * b[il]; 


return sum; 


By enabling unrolling, performance has doubled from one result per 6-cycle 
kernel to two results per 6-cycle kernel. By allowing the compiler to maximize 
unrolling when using the interrupt threshold of one, you can get most of the 
performance back. Example 8-6 shows a dot product loop that will execute a 
factor of 4 between 16 and 48 times. 


Example 8-6. Dot Product With MUST_ITERATE Pragma Guaranteeing Trip Count Range 
and Factor of 4 


int dot_prod(short *a, 


{ 


short *b, int n) 


int i, sum = 0; 


ERATE 


#pragma MUST_IT (16,48,4); 


for (i = 0; i < n; itt) 


sum += a[i] * b[il]l; 


return sum; 


The compiler knows that the trip count is some factor of four. The compiler will 
unroll this loop such that four iterations of the loop (four results are calculated) 
occur during the six cycle loop kernel. This is an improvement of four times 
over the first attempt at building the code with an interrupt threshold of one. The 
one drawback of unrolling the code is that code size increases, so using this 
type of optimization should only be done on key loops. 
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8.5 Interrupt Subroutines 


The interrupt subroutine (ISR) is simply the routine, or function, that is called 
by an interrupt. The "C6000 provides hardware to automatically branch to this 
routine when an interrupt is received based on an interrupt service table. (See 
the Interrupt Service Table in the TMS320C6000 CPU and Instruction Set Ref- 
erence Guide.) Once the branch is complete, execution begins at the first exe- 
cute packet of the ISR. 


Certain state must be saved upon entry to an ISR in order to ensure program 
accuracy upon return from the interrupt. For this reason, all registers that are 
used by the ISR must be saved to memory, preferably a stack pointed to by 
a general purpose register acting as a stack pointer. Then, upon return, all val- 
ues must be restored. This is all handled automatically by the C/C++ compiler, 
but must be done manually when writing hand-coded assembly. 


8.5.1 ISR with the C/C++ Compiler 


The C/C++ compiler automatically generates ISRs with the keyword interrupt. 
The interrupt function must be declared with no arguments and should return 
void. For example: 


interrupt void int_handler () 
{ 


unsigned int flags; 


} 


Alternatively, you can use the interrupt pragma to define a function to be an 
ISR: 


#pragma INTERRUPT (func) ; 


The result of either case is that the C/C++ compiler automatically creates a 
function that obeys all the requirements for an ISR. These are different from 
the calling convention of a normal C/C++ function in the following ways: 


(j All general purpose registers used by the subroutine must be saved to the 
stack. If another function is called from the ISR, then all the registers 
(AO—A15, BO—B15 for ’C62x and ’C67x, and AO-A31, BO—B31 for ’C64x) 
are saved to the stack. 


_j ABIRP instruction is used to return from the interrupt subroutine instead 
of the B B3 instruction used for standard C/C++ functions 


_j A function cannot return a value and thus, must be declared void. 


See the section on Register Conventions in the TMS320C6000 Optimizing 
C/C++ Compiler User’s Guide for more information on standard function call- 
ing conventions. 
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8.5.2 ISR with Hand-Coded Assembly 


When writing an ISR by hand, it is necessary to handle the same tasks the 
C/C++ compiler does. So, the following steps must be taken: 


Lj All registers used must be saved to the stack before modification. For this 
reason, itis preferable to maintain one general purpose register to be used 
as a stack pointer in your application. (The C/C++ compiler uses B15.) 


Lj If another C routine is called from the ISR (with an assembly branch in- 
struction to the _c_func_name label) then all registers must be saved to 
the stack on entry. 


(1 AB IRP instruction must be used to return from the routine. If this is the 
NMI ISR, a B NRP must be used instead. 


(_) An NOP 4 is required after the last LDW in this case to ensure that BO is 
restored before returning from the interrupt. 


Example 8-7. Hand-Coded Assembly ISR 


Assume Register BO-B4 & AO are the only registers used by the 
ISR and no other functions are called 

STW BO, *B15-- store BO to stack 

STW AO, *B15-- store AO to stack 

STW Bl, *B15-- store Bl to stack 

STW B2,*B15-- store B2 to stack 

STW B3, *B15-- store B3 to stack 

STW B4,*B15-- store B4 to stack 
* Beginning of ISR code 


+ 


Ne Ne Ne Ne Ne Ne 


* End of ISR code 


LDW *++B15,B4 ; restore B4 
LDW *++B15,B3 ; restore B3 
LDW *++B15,B2 ; restore B2 
LDW *++B15,Bl1 ; restore Bl 
LDW *++B15,A0 ; restore AO 
{| B IRP ; veturn from interrupt 
LDW *++B15,B0 ; restore BO 
NOP 4 ; allow all multi-cycle instructions 


to complete before branch is taken 


8-12 


Interrupt Subroutines 


8.5.3 Nested Interrupts 


Sometimes it is desirable to allow higher priority interrupts to interrupt lower 
priority ISRs. To allow nested interrupts to occur, you must first save the IRP, 
IER, and CSR to a register which is not being used or to or some other memory 
location (usually the stack). Once these have been saved, you can reenable 
the appropriate interrupts. This involves resetting the GIE bit and then doing 
any necessary modifications to the IER, providing only certain interrupts are 
allowed to interrupt the particular ISR. On return from the ISR, the original val- 
ues of the IRP, IER, and CSR must be restored. 
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Example 8-8. Hand-Coded Assembly ISR Allowing Nesting of Interrupts 


* Assume Register BO-B5 & AO are the only registers used by the 
* ISR and no other functions are called 
STW BO, *B15-= ; store BO to stack 
VC IRP, BO ; Save IRP 
STW AQ; *Bil5—— ; store AO to stack 
IVC IER, Bl ; Save IER 
VK mask, AO ; setup a new IER (if desirable) 
STW B1,*Bi5=- ; store Bl to stack 
IVC AO, IER ; setup a new IER (if desirable) 
STW B2, *B15-—= ; store B2 to stack 
IVC CSR,AO ; vead current CSR 
STW B3, *B15-- ; store B3 to stack 
OR 1,A0,A0 ; set GIE bit field in CSR 
STW B4,*B15-- ; store B4 to stack 
STW BS, *B15-=+ ; store B5 to stack 
IVC AO,CSR ; write new CSR with GIE enabled 
STW BO, *B15=— ; store BO to stack (contains IRP) 
STW Bi, *B15=- ; store Bl to stack (contains IER) 
STW AO, *B15-——- ; store AO to stack (original CSR) 
* Beginning of ISR code 
* End of ISR code 
B restore ; Branch to restore routine 
; disable CSR in delay slots of branch 
MVKL OFFFEh, AO ; create mask to disable GIE bit 
MVKLH OFFFFh,AO 
MVC CSR; B5 ; vead current CSR 
AND AO,B5,B5 ; AND BS with mask 
MVC B5,CSR j; write new CSR with GIE disabled 
restore ; restore routine begins at next line 
LDW *++B15,A0 ; restore AO (original CSR) 
LDW *++B15,B1 ; vestore Bl (contains IER) 
LDW *++B15,B0 ; vestore BO (contains IRP) 
LDW *++B15,B4 ; restore B4 
LDW *++B15,B3 ; restore B3 
LDW *++B15,B5 ; restore B5 
LDW *++B15,B2 ; restore B2 
{| MVvc BO, IRP ; restore original IRP 
B IRP ; return from interrupt 
LDW *++B15,Bl1 ; restore Bl 
MVC Bl, IER ; restore original IER 
LDW *++B15,A0 ; restore AO 
LDW *++B15,B0 ; restore BO 
MVC AO,CSR ; restore original CSR 
; to complete before branch is taken 
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Linking Issues 


This chapter contains useful information about other problems and questions 
that might arise while building your projects, including: 


.) What to do with the relocation value truncated linker and assembler mes- 
sages 


_) How to save on-chip memory by moving the RTS off-chip 


_) How to build your application with RTS calls either near or far 


_) How to change the default RTS data from far to near 
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9.1.1 


How to Use Linker Error Messages 


When you try to call a function which, due to how you linked your application, 
is too far away from acall site to be reached with the normal PC-relative branch 
instruction, you will see the following linker error message: 


>> PC-relative displacement overflow. Located in file.obj, 
section .text, SPC offset 000000bc 

This message means that in the named object file in that particular section, is 
a PC-relative branch instruction trying to reach a call destination that is too far 
away. The SPC offset is the section program counter (SPC) offset within that 


section where the branch occurs. For C code, the section name will be .text 
(unless a CODE_SECTION pragma is in effect). 


You might also see this message in connection with an MVK instruction: 


>> relocation value truncated at Oxa4 in section .text, 
file file.obj 


Or, an MVK can be the source of this message: 


>> Signed 16-bit relocation out of range, value truncated. 
Located in file.obj, section .text, SPC offset 000000a4 


How to Find The Problem 


These messages are similar. The file is file.obj, the section is .text, and the 
SPC offset is Oxa4. If this happens to you when you are linking C code, here 
is what you do to find the problem: 


(1 Recompile the C source file as you did before but include —s —al in the op- 
tions list 


cl6x <other options> -s -al file.c 


This will give you C interlisted in the assembly output and create an assembler 
listing file with the extension .Ist. 


(4 Edit the resulting .Ist file, in this case file.Ist. 


(Jj Each line in the assembly listing has several fields. For a full description 
of those fields see section 3.10 of the TM@S320C6000 Assembly Language 
Tools User’s Guide. The field you are interested in here is the second one, 
the section program counter (SPC) field. Find the line with the same SPC 
field as the SPC offset given in the linker error message. It will look like: 


245 000000bc OFFFECI1O! B -S1l  _atoi 7; |56| 
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9.1.1.1. Far Function Cells 


In this case, the call to the function atoi is too far away from the location where 
this code is linked. 


It is possible that use of —s will cause instructions to move around some and 
thus the instruction at the given SPC offset is not what you expect. The branch 
or MVK nearest to that instruction is the most likely cause. Or, you can rebuild 
the whole application with —s —al and relink to see the new SPC offset of the 
error. 


If you are tracing a problem in a hand-coded assembly file, the process is simi- 
lar, but you merely re-assemble with the —| option instead of recompiling. 


To fix a branch problem, your choices are: 


_} Use the —mr1 option to force the call to atoi, and all other RTS functions, 
to be far. 


Compile with —ml1 or higher to force all calls to be far. 


Rewrite your linker command file (looking at a map file usually helps) so 
that all the calls to atoi are close (within 0x100000 words) to where atoi is 
linked. 


9.1.1.2 Far Global Data 


If the problem instruction is an MVK, then you need to understand why the 
constant expression does not fit. 


For C code, you might find the instruction looks like: 


50 000000a4 0200002A% MVK (_ary-Sbss),B4 ri ES | 


In this case, the address of the C object ary is being computed as if ary is de- 
clared near (the default), but because it falls outside of the 15-bit address 
range the compiler presumes for near objects, you get the warning. To fix this 
problem, you can declare ary to be far, or you can use the correct cl6x —ml n 
memory model option to automatically declare ary and other such data objects 
to be far. See chapter 2 of the TMS320C6000 Optimizing C/C++ Compiler 
User’s Guide for more information on —ml n. 


It is also possible that ary is defined as far in one file and declared as near in 
this file. In that case, insure ary is defined and declared consistently to all files 
in the project. 
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Example 9-1. Referencing Far Global Objects Defined in Other Files 


<filel.c> 


/* Define ary to be a global variable not accessible via the data page */ 
/* pointer. */ 


far int Ary}. 6 


<file2.c> 

/* In order for the code in file2.c to access ary correctly, it must be 
defined as ’extern far’. ‘’extern’ informs the compiler that ary is 
defined in some other file. ‘far’ informs the compiler that ary is 
accessible via the data page pointer. If the ’far’ keyword is 
missing, then the compiler will incorrectly assume that ary is in 
-bss and can be accessed via the data page pointer. 


extern far in ary; 


9.1.1.3. The MVKL Mnemonic 
If the MVK instruction is just a simple load of an address: 
123 000000a4 0200002A! MVK sym, B4 


Then the linker warning message is telling you that sym is greater than 32767, 
and you will end up with something other than the value of sym in B4. In most 
cases, this instruction is accompanied by: 


124 000000a8 0200006A! MVKH sym, B4 


When this is the case, the solution is to change the MVK to MVKL. 


On any other MVK problem, it usually helps to look up the value of the sym- 
bol(s) involved in the linker map file. 
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9.1.2 Executable Flag 


You may also see the linker message: 
>> warning: output file file.out not executable 


If this is due solely to MVK instructions, paired with MVKH, which have yet to 
be changed to MVKL, then this warning may safely be ignored. The loaders 
supplied by TI will still load and execute this .out file. 


If you implement your own loader, please be aware this warning message 
means the F_EXEC flag in the file header is not set. If your loader depends on 
this flag, then you will have to fix your MVK instructions, or use the switches 
described above to turn off these warnings. 
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9.2 How to Save On-Chip Memory by Placing RTS Off-Chip 


One of many techniques you might use to save valuable on-chip space is to 
place the code and data needed by the runtime-support (RTS) functions in off- 
chip memory. 


Placing the RTS in off-chip memory has the advantage of saving valuable on- 
chip space. However, itcomes at a cost. The RTS functions will run much slow- 
er. Depending on your application, this may or may not be acceptable. It is also 
possible your application doesn’t use the RTS library much, and placing the 
RTS off-chip saves very little on-chip memory. 


Table 9-1. Definitions 


Term 


Means 


Normal RTS 
functions 


Internal RTS 
functions 


near calls 


far calls 


Ordinary RTS functions. Example: strcpy 


Functions which implement atomic C operations such as divide or floating point math on the 
C62x and C64x. Example: _divu performs 32-bit unsigned divide. 


Function calls performed with a ordinary PC-relative branch instruction. The destination of 
such branches must be within 1 048 576 (0x100000) words of the branch. Such calls use 1 
instruction word and 1 cycle. 


Function calls performed by loading the address of the function into a register and then 
branching to the address in the register. There is no limit on the range of the call. Such calls 
use 3 instruction words and 3 cycles. 


9.2.1 How to Compile 


Make use of shell (cl6x) options for controlling how RTS functions are called: 


Table 9-2. Command Line Options for RTS Calls 


Option Internal RTS calls Normal RTS calls 
Default Same as user Same as user 
—mr0 Near Near 


—mr1 Far Far 


By default, RTS functions are called with the same convention as ordinary 
user-coded functions. If you do not use a —ml rn option to enable one of large- 
memory models, then these calls will be near. The option —mr0 causes calls 


How to Save On-Chip Memory by Placing RTS Off-Chip 


to RTS functions to be near, regardless of the setting of the —ml n switch. This 
option is for special situations, and typically isn’t needed. The option —mr1 will 
cause calls to RTS functions to be far, regardless of the setting of the —ml n 
switch. 


Note these options only address how RTS functions are called. Calling func- 
tions with the far method does not mean those functions must be in off-chip 
memory. It simply means those functions can be placed at any distance from 
where they are called. 


9.2.2 Must #include Header Files 


9.2.3. RTS Data 


When you call a RTS function, you must include the header file which corre- 
sponds to that function. For instance, when you call memcmp, you must #in- 
clude <string.h>. If you do not include the header, the memcmp call looks like 
a normal user call to the compiler, and the effect of using —mr1 does not occur. 


Most RTS functions do not have any data of their own. Data is typically passed 
as arguments or through pointers. However, a few functions do have their own 
data. All of the ”is<xxx>” character recognition functions defined in ctype.h re- 
fer to a global table. Also, many of the floating point math functions have their 
own constant look-up tables. All RTS data is defined to be far data, for exam- 
ple, accessed without regard to where itis in memory. Again, this does not nec- 
essarily mean this data is in off-chip memory. 


Details on how to change access of RTS data are given in section 9.2.7 
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9.2.4 How to Link 


You place the RTS code and data in off-chip memory through the linking pro- 
cess. Here is an example linker command file you could use instead of the 
Ink.cmd file provided in the lib directory. 


/* farink.cmd — Link command file which puts RTS off-chip 

[KK KKK A A a a a a Ta a a a a a / 
-c 

-heap 0x2000 

-stack 0x4000 


/* Memory Map 1 - the defaul 
MEMORY 
{ 


00000000 = 00010000h 
= 00400000 = 01000000h 
= 01400000 00400000h 
= 02000000 = 01000000h 
= 03000000 = 01000000h 
= 80000000h = 00010000h 


ECTIONS 


/* 
/* Sections defined only in RTS. 


/* 
.stack BMEM 
.sysmem BMEM 
«CLo EXTO 


/* 
/* Sections of user code and data 
/* 
-CExt PMEM 
.bss BMEM 
-const BMEM 
.data BMEM 
.Sswitch BMEM 
,far EXT2 


/* 
/* All of .cinit, including from RTS, must be collected together 
/* in one step. 

/* 


<Cin2t > 
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/* 
/* RTS code - placed off chip 
/* 
.rtstext { -lrts6200.lib(.text) 


/* 
/* RTS data undefined sections - placed off chip 
res 


»¥tsbes { -lrts6200.lib(.bss) 
-lrts6200.lib(.far) 


/* 
/* RTS data —- defined sections - placed off chip 
/* 
.ctsdata { =Lrtes6200.lib(.const) 
=-lrts6200.1lib(.switch) } > 


User sections (.text, .bss, .const, .data, .switch, .far) are built and allocated 
normally. 


The .cinit section is built normally as well. It is important to not allocate the RTS 
.cinit sections separately as is done with the other RTS sections. All of the .cinit 
sections must be combined together into one section for auto-initialization of 
global variables to work properly. 


The .stack, .sysmem, and .cio sections are entirely created from within the 
RTS. So, you don’t need any special syntax to build and allocate these sec- 
tions separately from user sections. Typically, you place the .stack (system 
stack) and .sysmem (heap of memory used by malloc, etc.) sections in on-chip 
memory for performance reasons. The .cio section is a buffer used by printf 
and related functions. You can typically afford slower performance of such I/O 
functions, so it is placed in off-chip memory. 


The .rtstext section collects all the .text, or code, sections from RTS and allo- 
cates them to external memory name EXTO. If needed, replace the library 
name rts6200.lib with the library you normally use, perhaps rts6700.lib. The 
—lis required, and no space is allowed between the -I and the name of the libra- 
ry. The choice of EXTO is arbitrary. Use the memory range which makes the 
most sense in your application. 


The .rtsbss section combines all of the undefined data sections together. Un- 
defined sections reserve memory without any initialization of the contents of 
that memory. You use .bss and .usect assembler directives to create unde- 
fined data sections. 
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The .rtsdata section combines all of the defined data sections together. De- 
fined data sections both reserve and initialize the contents of a section. You 
use the .sect assembler directive to create defined sections. 


It is necessary to build and allocate the undefined data sections separately 
from the defined data sections. When a defined data section is combined to- 
gether with an undefined data section, the resulting output section is a defined 
data section, and the linker must fill the range of memory corresponding to the 
undefined section with a value, typically the default value of 0. This has the un- 
desirable effect of making your resulting .out file much larger. 


You may get a linker warning like: 


>> farlnk.cmd, line 65: warning: rts6200.lib(.switch) not 
found 

That means none of the RTS functions needed by your application define a 
.switch section. Simply delete the corresponding —| entry in the linker com- 
mand file to avoid the message. If your application changes such that you later 
do include an RTS function with a .switch section, it will be linked next to the 
.Switch sections from your code. This is fine, except it is taking up that valuable 
on-chip memory. So, you may want to check for this situation occasionally by 
looking at the linker map file you create with the —m linker option. 


—_——_—_v>K’_— — SS ——— 500  — — — = ss esses 


Note: Library Listed in Command File and On Command Line 


If a library is listed in both a linker command file and as an option on the com- 
mand line (including make files), check to see that the library is referenced 
similarly. 


For example, if you have: 


-rctstext {-lrts6200.lib(text)} > EXTO 


and you build with: 
cl6x <options> <files> -z -l<path>rts6200.lib 


you might receive an error message from the linker. In this case, check to see 
that both references either contain the full pathname or assure that neither 


of them don’t. 
qj a asa‘ ss sss sss] 


9.2.5 Example Compiler Invocation 


A typical build could look like: 

cl6x -mrl <other options> <C files> -z -o app.out 

-m app.map farlink.cmd 

In this one step you both compile all the C files and link them together. The 
C6000 executable image file is named app.out and the linker map file is named 
app.map. 
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Refer to section 4.4.1 to learn about the linker error messages when calls go 
beyond the PC relative boundary. 


9.2.6 Header File Details 


Look at the file linkage.h in the include directory of the release. Depending on 
the value of the FFAR_RTS macro, the macro _CODE_ ACCESS is setto force 
calls to RTS functions to be either user default, near, or far. The FAR RTS 
macro is set according to the use of the —mr n switch. 


Table 9-3. How _FAR_RTS is Defined in Linkage.h With —mr 


Option Internal RTS calls Normal RTS calls _FAR_RTS 
Default Same as user Same as user Undefined 
—mr0 Near Near 0 

—mr1 Far Far 1 


The _DATA_ACCESS macro is set to always be far. 
The _IDECL macro determines how inline functions are declared. 


All of the RTS header files which define functions or data include linkage.h 
header file. Functions are modified with CODE ACCESS: 


extern _CODE_ACCESS void exit(int _status); 


and data is modified with DATA ACCESS: 


extern _DATA_ACCESS unsigned char _ctypes_[]; 


9.2.7 Changing RTS Data to near 


If for some reason you do not want accesses of RTS data to use the far access 
method, take these steps: 


1 Go to the include directory of the release. 


1) Edit linkage.h, and change the: 


define _DATA ACCESS far 


macro to 


define _DATA ACCESS near 


to force all access of RTS data to use near access, or 
change it to 


define _DATA ACCESS 
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if you want RTS data access to use the same method used when accessing 
ordinary user data. 


[41 Copy linkage.h to the lib directory. 
.j Go to the lib directory. 


(Jj Replace the linkage.h entry in the source library: 
ar6x -r rts.srce linkage.h 


_j Delete linkage.h. 


(4 Rename or delete the object library you use when linking. 


(1 Rebuild the object library you use with the library build command listed in 
the readme file for that release. 


Note that you will have to perform this process each time you install an update 
of the code generation toolset. 
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